Professional Documents
Culture Documents
384 File Paper
384 File Paper
Abstract ground. First, previous work used multiple and of- 042
ten overlapping definitions and categories of hallu- 043
001 Although the problem of hallucinations in neu- cinations which makes it hard to draw connections 044
002 ral machine translation (NMT) has received between observations made in different works (Lee 045
003 some attention, research on this highly patho-
et al., 2018; Raunak et al., 2021; Zhou et al., 2021). 046
004 logical phenomenon lacks solid ground. Pre-
005 vious work has been limited in several ways: Next, since hallucinations are extremely rare, pre- 047
006 it often resorts to artificial settings where the vious work focused on settings in which the phe- 048
007 problem is amplified, it disregards some (com- nomenon is amplified, e.g. perturbing data either 049
008 mon) types of hallucinations, and it does not in training or at inference, or evaluating under do- 050
009 validate adequacy of detection heuristics. In main shift (Lee et al., 2018; Raunak et al., 2021; 051
010 this paper, we set foundations for the study Müller et al., 2020; Wang and Sennrich, 2020; 052
011 of NMT hallucinations. First, we work in a
Voita et al., 2021; Müller and Sennrich, 2021; Zhou 053
012 natural setting, i.e., in-domain data without
013 artificial noise neither in training nor in in- et al., 2021). Critically, the analysis on these works 054
014 ference. Next, we annotate a dataset of over mostly relied on the adequacy of the automatic hal- 055
015 3.4k sentences indicating different kinds of crit- lucination detection methods they proposed. How- 056
016 ical errors and hallucinations. Then, we turn ever, it is not immediate whether these methods 057
017 to detection methods and both revisit methods translate well to unperturbed settings. 058
018 used previously and propose using glass-box
019 uncertainty-based detectors. Overall, we show In this work, we set foundations for the study of 059
020 that for preventive settings, (i) previously used NMT hallucinations. We take a step back from pre- 060
021 methods are largely inadequate, (ii) sequence vious work and, instead of considering perturbed 061
022 log-probability works best and performs on par settings for which hallucinations are more frequent, 062
023 with reference-based methods. Finally, we pro- we consider a natural scenario and face the actual 063
024 pose D E H ALLUCINATOR, a simple method for problem of identifying a small fraction of halluci- 064
025 alleviating hallucinations at test time which sig-
nations (a “needle”) in a large number of translated 065
026 nificantly reduces the hallucinatory rate.
sentences (a “haystack”). Then, we provide a rig- 066
1
Category Source Sentence Reference Translation Hallucination
Ist ein Kompromiss aufgrund des zugrundeliegen- The case where, based on the pertinent system
Aporia is the name of aporia , which is
Oscillatory den Regelsystems unmöglich, so spricht man von of regulations a compromise is not possible, is
the name of aporia.
Aporie. referred to as Aporia.
Strongly Tickets für Busse und die U-Bahn ist zu teuer, vor Tickets for buses and the subway is too expen- The hotel is located in the centre of Stock-
Detached allem in Stockholm. sive, especially in Stockholm. holm, close to the train station.
Fully Die Zimmer beziehen, die Fenster mit Aussicht Head up to the rooms, open up the windows
The staff were very friendly and helpful.
Detached öffnen, tief durchatmen, staunen. and savour the view, breathe deeply, marvel.
083 2020a) which uses reference translation and thus ity assessment taxonomies, such as MQM (Lom- 122
084 cannot be used in most real-world applications. Sur- mel et al., 2014), might be unfit or too complex 123
085 prisingly, its reference-free version COMET-QE when we focus only on critical errors and halluci- 124
086 (Rei et al., 2020b), which was shown to generally nations. For hallucinations, in turn, previous work 125
087 perform on par with COMET (Kocmi et al., 2021; used multiple, often overlapping, definitions (Lee 126
088 Freitag et al., 2021), substantially fails to penalise et al., 2018; Raunak et al., 2021; Zhou et al., 2021; 127
089 the severity of hallucinations. Overall, methods tar- Ji et al., 2022; Raunak et al., 2022). The taxonomy 128
090 geting the phenomena are largely unfit, quality esti- we build here is rather general: it covers categories 129
091 mation systems fail, and sequence log-probability, considered previously (Lee et al., 2018; Raunak 130
092 i.e. a byproduct of generating a translation, turns et al., 2021) and others not reported before. 131
093 out to be the best.
094 Apart from our analysis of detection methods, 2.1 Hallucinations 132
095 we propose D E H ALLUCINATOR, a method for alle- To distinguish hallucinations from other errors, we 133
096 viating hallucinations at test time. At a high level, rely on the idea of detachment from the source se- 134
097 we first apply a lightweight hallucination detec- quence. From this perspective, other critical errors 135
098 tor and then, if a translation is flagged, we try to such as mistranslation of named entities are not 136
099 overwrite it with a better version. For this, we considered as hallucinations. 137
100 generate several MC-dropout hypotheses (Gal and
101 Ghahramani, 2016), score them with some mea- Oscillatory hallucinations. These are inade- 138
102 sure, and pick the highest-scoring translation as the quate translations that contain erroneous repetitions 139
103 final candidate. With this approach, the proportion of words and phrases. 140
104 of correct translations among the ones flagged by Largely fluent hallucinations. These are largely 141
105 the detector increases from 33% to 85%, and the fluent translations that are unrelated to the content 142
106 hallucinatory rate decreases threefold. of the source sequence. Previous work assumed 143
107 Overall, we show that (i) in preventive settings, they always bear no relation at all to the source con- 144
108 previously proposed hallucination detectors are tent (Lee et al., 2018; Raunak et al., 2021). How- 145
109 mostly inadequate; (ii) quality estimation tech- ever, we find that a large proportion of fluent hal- 146
110 niques fail to distinguish hallucinations from less lucinations partially support the source. Therefore, 147
111 severe errors; (iii) sequence log-probability is the we also consider severity of a hallucination and 148
112 best hallucination detector and performs on par distinguish translations that are fully detached from 149
113 with reference-based COMET; and, (iv) our D E - those that are strongly (but not fully) detached. 150
114 H ALLUCINATOR significantly alleviates hallucina- Note that oscillatory hallucinations can also be 151
115 tions at test time. either fully or only partially detached, but since 152
116 Our annotated dataset along with the model, these hallucinations are less frequent, in what fol- 153
117 training data, and code are available.1 lows we do not split them by severity. We show 154
examples of these hallucination types in Table 1. 155
118 2 Taxonomy of Translation Pathologies
119 Choosing a good taxonomy is a compromise be- 2.2 Translation errors 156
120 tween simplicity (which minimizes annotation ef- Undergeneration. These are incomplete transla- 157
121 fort) and comprehensiveness. Thus, generic qual- tions that do not cover part of the source content. 158
1
We will release these resources following deanonymiza- This problem is often studied in isolation (Koehn 159
tion. and Knowles, 2017; Stahlberg and Byrne, 2019; 160
2
161 Kumar and Sarawagi, 2019). Undergenerations are not available. Thus, we use reference-based meth- 206
162 sometimes considered as hallucinations (Lee et al., ods to estimate an upper bound for performance of 207
163 2018) but we do not consider them so in our work. the other methods. 208
164 Mistranslation of named entities. Appropri- 3.2 Hallucination Detection Heuristics 209
165 ately translating named entities (e.g. names, dates, 3.2.1 Previously Used Heuristics 210
166 etc.) is also a known difficulty of NMT sys-
Binary-score Heuristics. These heuristics were 211
167 tems (Ugawa et al., 2018; Li et al., 2021; Hu et al.,
used by Raunak et al. (2021) to detect oscillatory 212
168 2022). However, we do not consider it as an hallu-
and fully detached hallucinations. Given a corpus 213
169 cination as it does not show detachment from the
of source-translation pairs, a translation is flagged 214
170 source but rather an incorrect attempt to translate
as an hallucination if it is in the set of 1% lowest- 215
171 part of its content.
quality translations, and if: 216
172 Other errors. These are other incorrect transla- • Top n-gram count (TNG). The count of 217
173 tions that do not fit the categories above. the top repeated n-gram in the translation 218
is greater than the count of the top repeated 219
174 3 Hallucination Detection Methods source n-gram by at least t (in their work, 220
190 Reference-free methods. We use the state-of- 3.2.2 Uncertainty-Based Heuristics 237
191 the-art COMET-QE (Rei et al., 2020b) for its Now we describe the uncertainty measures we 238
192 superior performance compared to other met- propose to use as hallucination detectors. Previ- 239
193 rics (Mathur et al., 2020; Freitag et al., 2021; ously, these were used to improve quality assess- 240
194 Kocmi et al., 2021). ments (Fomicheva et al., 2020; Zerva et al., 2021). 241
195 Reference-based methods. Previous work used Sequence log-probability (Seq-Logprob). For 242
196 adjusted BLEU or CHRF2 scores of less than 1% a trained model P (y|x, θ) and a generated transla- 243
197 as a standalone criteria (Lee et al., 2018; Raunak tion y, Seq-Logprob (i.e., model confidence) is the 244
198 et al., 2021; Müller and Sennrich, 2021; Yan et al., length-normalised sequence log-probability: 245
3
249 Dissimilarity of MC hypotheses (MC-DSim). it using Bicleaner, the filtering tool used in offi- 293
250 This method measures how the original hypothe- cial releases of filtered ParaCrawl data (Sánchez- 294
251 sis y disagrees with hypotheses {h1 , . . . , hN } gen- Cartagena et al., 2018; Ramírez-Sánchez et al., 295
252 erated in stochastic passes. For the same source 2020; Kreutzer et al., 2022). Following previous 296
253 sentence, we generate these new hypotheses using work, we exclude examples with a score below 0.5 297
254 Monte Carlo (MC) Dropout (Gal and Ghahramani, and end up with about 1.3M examples. 298
255 2016). Then we evaluate the average similarity: All details on preprocessing, hyperparameters 299
and implementation can be found in Appendix A. 300
N
1 X
256 SIM (hi , y). (3) 5 Hallucinations Dataset 301
N
i=1
To analyse the effectiveness of hallucination detec- 302
257 Different similarity measures can be used in place tion criteria, we (i) pick a subset of examples that 303
258 of SIM (e.g. METEOR (Banerjee and Lavie, 2005), are likely to be hallucinations, and (ii) obtain fine- 304
259 BERTScore (Zhang et al., 2020), etc.). We fol- grained annotations from professional translators. 305
260 low previous work and use METEOR with N =
261 10 (Fomicheva et al., 2020; Zerva et al., 2021). 5.1 Data for Annotation 306
271 TokHal-Model. We evaluate the proportion of fall below a chosen percentile for a given method, 317
272 tokens that are predicted to be hallucinated and use i.e. we consider long tails of the scores.3 For in- 318
273 this as a detection score. domain settings, previous work reported hallucina- 319
tory rates of 0.2 − 2%. However, these rates were 320
274 3.4 Binary vs Continuous Scores either obtained on noisy and/or low-resource data 321
275 The methods above fall into two categories: binary- or using weaker models. Therefore, in our cleaner 322
276 score and continuous-score heuristics. The for- in-domain setting with a stronger model, we ex- 323
277 mer (only TNG and RT) output a value in {0, 1}, pect the hallucination rate to lie in the lower end 324
278 whereas the latter output a value in R and the pre- of the indicated range. Thus, we consider approxi- 325
279 diction is made depending on a chosen threshold. mately 0.4% of the worst scores (which amounts to 326
5000 flagged translations for each criteria).4 From 327
280 4 Experimental Setting these, we sample 250 examples and add them to 328
the dataset. In total, we end up with 3415 examples 329
281 Model. We use Transformer base (Vaswani et al.,
for annotation.5 We provide a high-level analysis 330
282 2017) from fairseq (Ott et al., 2019).
of the dataset in Appendix C. 331
283 Data. We use the WMT’18 DE-EN news transla-
5.2 Guidelines and Annotation 332
284 tion data excluding Paracrawl (Bojar et al., 2018) –
285 5.8M sentence pairs. We randomly choose 2/3 of The annotation guidelines are developed according 333
286 the dataset for training and use the remaining 1/3 to the taxonomy defined in Section 2. More details 334
287 as a held-out set for analysis. For validation, we 3
From now on, we refer to examples contained in a long
288 use the newstest2017 dataset. tail of a method as “flagged” or “detected” by this method.
4
For continuous-score heuristics, the threshold for predic-
289 Held-out Data Filtering. We are mainly inter- tion is consistent with this percentile. For practical details,
290 ested in hallucinations produced for clean data. refer to Appendix I.
5
Note that a sample originally obtained from the worst
291 Since our held-out data comes from the WMT’18 scores or sampled from the long tail of a given method may
292 training dataset and thus can be noisy, we filter belong to the long tail of another method.
4
Figure 1: Overall (left) and method-specific (right)
statistics of human annotation results. Method-specific
statistics show the percentages of correct translations
(grey), translation errors (yellow) and hallucinations
(red) among the examples flagged by each method.
Figure 2: Structure of the set of hallucinations flagged
by each criteria. Horizontal bars show the percentage
335 on data collection can be found in Appendix B.
of hallucinations correctly flagged. Each vertical bar
shows the size for the set of hallucinations that are
336 6 Analysing Detection Criteria (i) flagged by all the methods marked in the correspond-
337 In this section, we provide a comprehensive analy- ing column and (ii) not flagged by any of the rest; only
the top intersections are shown. Quality filters are shown
338 sis of the performance of the heuristics and quality
with diamond marks, and detection heuristics are shown
339 filters introduced in Section 3. with circles. Reference-based methods are shaded.
340 6.1 Quality Filters
341 Here we start by analysing reference-based meth- that COMET-QE is one of the worst hallucination 369
342 ods, namely COMET and CHRF2, and then turn detectors, meaning that it fails to rank by the sever- 370
343 to the reference-free COMET-QE. Overall, our re- ity of a translation pathology. This supports the 371
344 sults show that reference information is helpful, hypothesis made in previous work: since quality 372
345 while COMET-QE fails to penalise hallucinations. estimation models are mostly trained on data that 373
lacks negative examples, they may be inadequate 374
346 Reference information helps detection. Fig- for evaluating poor translations (Takahashi et al., 375
347 ure 2 shows that leveraging reference information 2021; Sudoh et al., 2021). 376
348 helps detecting hallucinations: COMET detects Overall, among the considered quality filters, 377
349 more hallucinations than any of the other methods. only COMET may be used as a hallucination detec- 378
350 As expected, lexical-based CHRF2 is significantly tor. However, it is important to keep in mind that, 379
351 worse than neural-based COMET. In fact, it lags in real-world applications, detecting hallucinations 380
352 behind several heuristics (e.g., Seq-Logprob). is only needed when references are not available, 381
353 We also explore whether previously proposed rendering reference-based methods not applicable. 382
354 methods for detecting hallucinations under domain
355 shift are appropriate in our cleaner in-domain set- 6.2 Detection Heuristics 383
356 ting. Specifically, we follow Müller and Sennrich The results in Section 6.1 leave a relevant gap: 384
357 (2021) and consider translations with CHRF2 score where do we turn to when references are not avail- 385
358 lower than 1%. Strikingly, for our clean setting, able? Is there any information, besides quality, that 386
359 this approach is inadequate: in the entire held-out may help detecting hallucinations? Our evidence 387
360 set of 1.3M examples, it flagged only 2 translations. suggests that in preventive settings, previously pro- 388
361 This suggests that methods suitable for noisy set- posed heuristics are mostly inadequate, and uncer- 389
362 tings (e.g., domain shift) might not be applicable in tainty may be the answer for the questions we pose. 390
363 settings where models are less likely to hallucinate. In what follows, we will present the results for the 391
methods described in Section 3.2. For analysis on 392
364 COMET-QE fails to penalise hallucinations.
other detectors, refer to Appendix D. 393
365 From Figure 1 (right) we see that, as expected,
366 most of the translations flagged by COMET-QE Binary-score heuristics perform the worst. Ta- 394
367 are incorrect. However, the vast majority of them ble 2 shows the number of detected translations 395
368 are not hallucinations. Indeed, Figure 2 shows for Top n-gram count (TNG) and Repeated tar- 396
5
Heuristic Correct MT Hallucinations is evidence that attention can play recognizable 440
Errors OSC SD FD roles (Voita et al., 2018, 2019), a lot of work ques- 441
TNG 0 0 32 0 0 tions attention explainability (Wiegreffe and Pinter, 442
RT 18 19 2 1 7
2019; Jain and Wallace, 2019; Serrano and Smith, 443
All dataset 2048 1073 86 90 118
2019; Bastings and Filippova, 2020; Pruthi et al., 444
411 Anomalous attention is mostly not hallucination. multiple hypotheses for MC-DSim). 462
412 Figures 1 and 2 show that behaviors of Attn-to- MC-DSim: hallucinations are mostly unstable. 463
413 EOS and Attn-ign-SRC are significantly different. The intuition behind this heuristic is simple: when 464
414 First, Attn-to-EOS is not indicative of hallucina- faced with a source sentence for which a good trans- 465
415 tions. Indeed, attention patterns in which most lation is not immediate, the set of hypotheses the 466
416 attention mass is concentrated on the EOS token model “keeps at hand” may be very diverse. Indeed, 467
417 mostly correspond to correct translations. On the MC-DSim performs relatively well and identifies 468
418 other hand, Attn-ign-SRC performs well and is sec- a good proportion of hallucinations (Figures 1, 2). 469
419 ond only to uncertainty-based Seq-Logprob. Such Thus, most hallucinations are unstable: dissimilar- 470
420 a difference in performance is very surprising: both ity of MC hypotheses helps to identify them. 471
421 methods are motivated by a common belief that if
422 almost all the attention mass is concentrated on 6.3 Combining Heuristics and Quality Filters 472
423 the source EOS token, a translation is likely to be Intersecting a set of translations obtained via a 473
424 a hallucination (Berard et al., 2019; Raunak et al., heuristic with the bottom scored translations ac- 474
425 2021). In fact, both methods were designed to cording to a quality filter was introduced in Raunak 475
426 identify this specific pattern. However, patterns et al. (2021). The motivation for this was to avoid 476
427 identified with Attn-ign-SRC span from attention incorrectly flagging good translations as halluci- 477
428 mass coming to various uninformative tokens (e.g., nations (e.g., without this intersection, RT flags 478
429 punctuation) to examples where attention is mostly good-quality paraphrases). Intuitively, this idea is 479
430 diagonal (typically, these correspond to undergen- very reasonable: hallucinations are indeed incor- 480
431 erations). We show examples of such attention rect translations. However, implementing this in 481
432 maps in Appendix E. Overall, the results highlight practice requires a good quality estimation model, 482
433 a disparity between what is believed to indicate specifically for ranking poor translations. Unfor- 483
434 hallucinations and what actually indicates them. tunately, we showed in Section 6.1 that for this 484
435 Note that while Attn-ign-SRC performs rel- purpose, even the state-of-the-art COMET-QE is 485
436 atively well, it should be used with caution: largely inadequate. This means that using such 486
437 attention-based heuristics rely on the assumption quality estimates may lead to filtering out a lot of 487
438 that attention patterns reflect model reasoning. hallucinations, which is not desirable in preven- 488
439 This assumption is not reliable: although there tive settings. Our results show that this is exactly 489
6
(a) Binary-score heuristics (b) Continuous-score heuristics (c) Quality filters
Figure 3: Distribution of the translations flagged by each method conditioned on each of the considered pathologies.
490 the case: e.g., filtering with COMET-QE leads to during training. Therefore, fully detached halluci- 526
491 losing nearly 80% of hallucinations detected by nations are indeed likely to be traced back to the 527
492 Seq-Logprob. All in all, such an intersection gener- training data, but they emerge in non trivial ways 528
493 ally does more harm than good. and not necessarily as exact copies. This can be 529
seen as one more evidence that, when dealing with 530
494 7 Analysing Hallucination Pathologies memorisation in language models, it is necessary 531
to consider not just full copies in the training data 532
495 In this section, we look at hallucination pathologies
but also near-duplicates (Lee et al., 2022). 533
496 in isolation and show that the behavior of detection
497 methods varies depending on the type of pathology. Strongly detached hallucinations. As expected, 534
498 For example, for a given pathology, some methods this pathology is harder to detect than fully de- 535
499 may be specialised, whereas others may fail. For a tached hallucinations (Figure 3). However, the 536
500 similar analysis on other critical translation errors, trends are largely similar: for example, COMET- 537
501 refer to Appendix F. QE fails again, and Seq-Logprob performs best and 538
outperforms even reference-based COMET. 539
502 Fully detached hallucinations. Figure 3 shows
503 that these hallucinations are easily detected by sev- Oscillatory hallucinations. The method specif- 540
504 eral methods, e.g. Seq-Lobprob, Attn-ign-SRC, ically developed to detect this hallucination 541
505 COMET, CHRF2. This is not surprising: intu- type (Top n-gram count, TNG) performs worse 542
506 itively, the most severe pathology should be the than most of the other methods. Among the rest, 543
507 easiest to detect. However, COMET-QE fails to COMET performs best. Interestingly, in contrast to 544
508 identify almost all these hallucinations. While previous observations, COMET-QE performs well, 545
509 COMET-based metrics are known to not penalise being on par with COMET. 546
510 enough certain types of errors (e.g., discrepancies
511 in numbers and named entities; see Amrhein and 8 D E H ALLUCINATOR: Overwriting 547
512 Sennrich (2022); Raunak et al. (2022)), such poor Hallucinations at Test Time 548
513 performance for completely inadequate translations
514 is highly unexpected. This calls for further research In previous sections, we saw that hallucinations 549
515 on the behavior of quality estimation models. are more unstable than other translations: for them, 550
516 On a more general note, let us recall that pre- generated MC-dropout hypotheses tend to differ 551
517 vious work suggested that fully detached halluci- a lot. This motivates us to look more closely into 552
518 nations emerge as exact copies of references from these hypotheses: are any of these translations not 553
519 training data (Raunak et al., 2021). We validate hallucinations? Answering this question not only 554
520 this hypothesis and find the contrary: out of the 44 gives insight into inner workings of hallucinating 555
521 unique translations marked as fully detached from NMT models, but also leads to an interesting prac- 556
522 the source, only 4 are exact copies of references tical application – overwriting hallucinations at in- 557
523 in the training data. Nevertheless, when looking ference time. This is of utmost importance for pro- 558
524 at these sentences more closely, we see that they duction systems where hallucinations have deeply 559
525 do contain large substrings that are seen frequently compromising effect on user trust. 560
7
written with correct translations. This is surprising: 599
in most cases, the model is not stuck in a hallucina- 600
tory mode and can generate good translations in a 601
small vicinity of model parameters. In this sense, 602
most hallucinations result from “bad luck” during 603
generation and not profound model defect. Note 604
that fully detached hallucinations are the hardest to 605
improve. This makes sense as these are likely to be 606
Figure 4: Our pipeline scheme along with results. traced back to (near-)duplicates in the training data 607
and, therefore, they do highlight model anomalies. 608
8
648 Limitations Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, 698
Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp 699
649 We highlight three main limitations in our work. Koehn, Nicholas Lourie, Christof Monz, Makoto 700
650 First, although the foundation of our proposed tax- Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki 701
Nakazawa, Matteo Negri, Santanu Pal, Allahsera Au- 702
651 onomy for hallucinations rests on the idea of de- guste Tapo, Marco Turchi, Valentin Vydrin, and Mar- 703
652 tachment from the source content, we do not eval- cos Zampieri. 2021. Findings of the 2021 conference 704
653 uate it quantitatively. Indeed, we cannot point on machine translation (WMT21). In Proceedings of 705
654 whether the model that generated the hallucina- the Sixth Conference on Machine Translation, pages 706
1–88, Online. Association for Computational Linguis- 707
655 tions was indeed detached from the source sen-
tics. 708
656 tence when generating them. Nevertheless, we
657 can guarantee that the hallucinations in our dataset Chantal Amrhein and Rico Sennrich. 2022. Identifying 709
658 are translations that are detached from the source weaknesses in machine translation metrics through 710
minimum bayes risk decoding: A case study for 711
659 content according to professional translators. We comet. 712
660 consider that the quantitative analysis of the detach-
661 ment to be out of the scope of this paper. Still, it Satanjeev Banerjee and Alon Lavie. 2005. METEOR: 713
662 constitutes an interesting line for future research An automatic metric for MT evaluation with im- 714
proved correlation with human judgments. In Pro- 715
663 on understanding hallucinations in NMT that may ceedings of the ACL Workshop on Intrinsic and Ex- 716
664 be facilitated with the release of our code, model trinsic Evaluation Measures for Machine Transla- 717
665 and annotated dataset. tion and/or Summarization, pages 65–72, Ann Arbor, 718
666 Second, while this paper comprehensively stud- Michigan. Association for Computational Linguis- 719
tics. 720
667 ies the phenomena of hallucinations in NMT for a
668 high-resource language pair, experiments in more Jasmijn Bastings and Katja Filippova. 2020. The ele- 721
669 language pairs (including low-resource languages) phant in the interpretability room: Why use attention 722
as explanation when we have saliency methods? In 723
670 are necessary to assess the broad validity of our
Proceedings of the Third BlackboxNLP Workshop 724
671 claims. To keep our setup familiar to researchers on Analyzing and Interpreting Neural Networks for 725
672 and practicioners, we opted for a familiar language NLP, pages 149–155, Online. Association for Com- 726
673 pair for which data is widely available. Moreover putational Linguistics. 727
674 our choice also facilitated the data collection pro- Alexandre Berard, Ioan Calapodescu, and Claude Roux. 728
675 cess as there is a large supply of professional trans- 2019. Naver labs Europe’s systems for the WMT19 729
676 lators for this language pair. machine translation robustness task. In Proceedings 730
677 Lastly, instead of focusing on more recent NMT of the Fourth Conference on Machine Translation 731
(Volume 2: Shared Task Papers, Day 1), pages 526– 732
678 models that use large pretrained language models 532, Florence, Italy. Association for Computational 733
679 as their backbone, we focused on a Transformer Linguistics. 734
680 base model. The reason for this choice is that we
681 wanted to keep the setup simple, familiar, easy to Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette 735
Graham, Barry Haddow, Philipp Koehn, and Christof 736
682 reproduce, and computationally economical. More- Monz. 2018. Findings of the 2018 conference on ma- 737
683 over, it was important for our work to have full chine translation (WMT18). In Proceedings of the 738
684 control on the training and held-out data. Neverthe- Third Conference on Machine Translation: Shared 739
685 less, research on hallucinations on more recent and Task Papers, pages 272–303, Belgium, Brussels. As- 740
sociation for Computational Linguistics. 741
686 powerful NMT models is an exciting line of future
687 work and we hope our work spurs that research. Jacob Cohen. 1960. A coefficient of agreement for 742
688 All code and annotated data samples will be pub- nominal scales. Educational and Psychological Mea- 743
surement, 20(1):37–46. 744
689 licly released following deanonymization.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, 745
Vishrav Chaudhary, Guillaume Wenzek, Francisco 746
690 References Guzmán, Edouard Grave, Myle Ott, Luke Zettle- 747
moyer, and Veselin Stoyanov. 2019. Unsupervised 748
691 Farhad Akhbardeh, Arkady Arkhangorodsky, Mag- cross-lingual representation learning at scale. 749
692 dalena Biesialska, Ondřej Bojar, Rajen Chatter-
693 jee, Vishrav Chaudhary, Marta R. Costa-jussa, Patrick Fernandes, António Farinhas, Ricardo Rei, José 750
694 Cristina España-Bonet, Angela Fan, Christian Fe- De Souza, Perez Ogayo, Graham Neubig, and Andre 751
695 dermann, Markus Freitag, Yvette Graham, Ro- Martins. 2022. Quality-aware decoding for neural 752
696 man Grundkiewicz, Barry Haddow, Leonie Harter, machine translation. In Proceedings of the 2022 753
697 Kenneth Heafield, Christopher Homan, Matthias Conference of the North American Chapter of the 754
9
755 Association for Computational Linguistics: Human pages 28–39, Vancouver. Association for Computa- 811
756 Language Technologies, pages 1396–1412, Seattle, tional Linguistics. 812
757 United States. Association for Computational Lin-
758 guistics. Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, 813
Daan van Esch, Nasanbayar Ulzii-Orshikh, Allah- 814
759 Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, sera Tapo, Nishant Subramani, Artem Sokolov, Clay- 815
760 Frédéric Blain, Francisco Guzmán, Mark Fishel, tone Sikasote, Monang Setyawan, Supheakmungkol 816
761 Nikolaos Aletras, Vishrav Chaudhary, and Lucia Spe- Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, An- 817
762 cia. 2020. Unsupervised quality estimation for neural nette Rios, Isabel Papadimitriou, Salomey Osei, Pe- 818
763 machine translation. Transactions of the Association dro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, An- 819
764 for Computational Linguistics, 8:539–555. dre Niyongabo Rubungo, Toan Q. Nguyen, Math- 820
ias Müller, André Müller, Shamsuddeen Hassan 821
765 Markus Freitag, David Grangier, Qijun Tan, and Bowen Muhammad, Nanda Muhammad, Ayanda Mnyak- 822
766 Liang. 2022. High Quality Rather than High Model eni, Jamshidbek Mirzakhalov, Tapiwanashe Matan- 823
767 Probability: Minimum Bayes Risk Decoding with gira, Colin Leong, Nze Lawson, Sneha Kudugunta, 824
768 Neural Metrics. Transactions of the Association for Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaven- 825
769 Computational Linguistics, 10:811–825. ture F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, 826
Sakine Çabuk Ballı, Stella Biderman, Alessia Bat- 827
770 Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, tisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, 828
771 Craig Stewart, George Foster, Alon Lavie, and Ondřej Israel Abebe Azime, Ayodele Awokoya, Duygu Ata- 829
772 Bojar. 2021. Results of the WMT21 metrics shared man, Orevaoghene Ahia, Oghenefego Ahia, Sweta 830
773 task: Evaluating metrics with expert-based human Agrawal, and Mofetoluwa Adeyemi. 2022. Quality 831
774 evaluations on TED and news domain. In Proceed- at a Glance: An Audit of Web-Crawled Multilingual 832
775 ings of the Sixth Conference on Machine Translation, Datasets. Transactions of the Association for Com- 833
776 pages 733–774, Online. Association for Computa- putational Linguistics, 10:50–72. 834
777 tional Linguistics.
Taku Kudo and John Richardson. 2018. SentencePiece: 835
778 Yarin Gal and Zoubin Ghahramani. 2016. Dropout as
A simple and language independent subword tok- 836
779 a bayesian approximation: Representing model un-
enizer and detokenizer for neural text processing. In 837
780 certainty in deep learning. In Proceedings of The
Proceedings of the 2018 Conference on Empirical 838
781 33rd International Conference on Machine Learn-
Methods in Natural Language Processing: System 839
782 ing, volume 48 of Proceedings of Machine Learning
Demonstrations, pages 66–71, Brussels, Belgium. 840
783 Research, pages 1050–1059, New York, New York,
Association for Computational Linguistics. 841
784 USA. PMLR.
785 Junjie Hu, Hiroaki Hayashi, Kyunghyun Cho, and Gra- Aviral Kumar and Sunita Sarawagi. 2019. Calibration 842
786 ham Neubig. 2022. DEEP: DEnoising entity pre- of encoder decoder models for neural machine trans- 843
787 training for neural machine translation. In Proceed- lation. CoRR, abs/1903.00802. 844
788 ings of the 60th Annual Meeting of the Association
789 for Computational Linguistics (Volume 1: Long Pa- Ann Lee, Michael Auli, and Marc’Aurelio Ranzato. 845
790 pers), pages 1753–1766, Dublin, Ireland. Association 2021. Discriminative reranking for neural machine 846
791 for Computational Linguistics. translation. In Proceedings of the 59th Annual Meet- 847
ing of the Association for Computational Linguistics 848
792 Sarthak Jain and Byron C. Wallace. 2019. Attention is and the 11th International Joint Conference on Natu- 849
793 not Explanation. In Proceedings of the 2019 Con- ral Language Processing (Volume 1: Long Papers), 850
794 ference of the North American Chapter of the Asso- pages 7250–7264, Online. Association for Computa- 851
795 ciation for Computational Linguistics: Human Lan- tional Linguistics. 852
796 guage Technologies, Volume 1 (Long and Short Pa-
797 pers), pages 3543–3556, Minneapolis, Minnesota. Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fan- 853
798 Association for Computational Linguistics. njiang, and David Sussillo. 2018. Hallucinations in 854
neural machine translation. 855
799 Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu,
800 Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Katherine Lee, Daphne Ippolito, Andrew Nystrom, 856
801 Madotto, and Pascale Fung. 2022. Survey of halluci- Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, 857
802 nation in natural language generation. and Nicholas Carlini. 2022. Deduplicating training 858
data makes language models better. In Proceedings 859
803 Tom Kocmi, Christian Federmann, Roman Grund- of the 60th Annual Meeting of the Association for 860
804 kiewicz, Marcin Junczys-Dowmunt, Hitokazu Mat- Computational Linguistics (Volume 1: Long Papers), 861
805 sushita, and Arul Menezes. 2021. To ship or not to pages 8424–8445, Dublin, Ireland. Association for 862
806 ship: An extensive evaluation of automatic metrics Computational Linguistics. 863
807 for machine translation.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan 864
808 Philipp Koehn and Rebecca Knowles. 2017. Six chal- Ghazvininejad, Abdelrahman Mohamed, Omer Levy, 865
809 lenges for neural machine translation. In Proceedings Veselin Stoyanov, and Luke Zettlemoyer. 2020. 866
810 of the First Workshop on Neural Machine Translation, BART: Denoising sequence-to-sequence pre-training 867
10
868 for natural language generation, translation, and com- parallel data. In Proceedings of the 22nd Annual 924
869 prehension. In Proceedings of the 58th Annual Meet- Conference of the European Association for Machine 925
870 ing of the Association for Computational Linguistics, Translation, pages 291–298, Lisboa, Portugal. Euro- 926
871 pages 7871–7880, Online. Association for Computa- pean Association for Machine Translation. 927
872 tional Linguistics.
Vikas Raunak, Arul Menezes, and Marcin Junczys- 928
873 Panpan Li, Mengxiang Wang, and Jian Wang. 2021. Dowmunt. 2021. The curious case of hallucinations 929
874 Named entity translation method based on ma- in neural machine translation. In Proceedings of 930
875 chine translation lexicon. Neural Comput. Appl., the 2021 Conference of the North American Chap- 931
876 33(9):3977–3985. ter of the Association for Computational Linguistics: 932
Human Language Technologies, pages 1172–1183, 933
877 Arle Lommel, Aljoscha Burchardt, and Hans Uszkor- Online. Association for Computational Linguistics. 934
878 eit. 2014. Multidimensional quality metrics (mqm):
879 A framework for declaring and describing transla- Vikas Raunak, Matt Post, and Arul Menezes. 2022. 935
880 tion quality metrics. Tradumàtica: tecnologies de la Salted: A framework for salient long-tail translation 936
881 traducció, 0:455–463. error detection. 937
882 Nitika Mathur, Johnny Wei, Markus Freitag, Qingsong Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon 938
883 Ma, and Ondřej Bojar. 2020. Results of the WMT20 Lavie. 2020a. COMET: A neural framework for MT 939
884 metrics shared task. In Proceedings of the Fifth Con- evaluation. In Proceedings of the 2020 Conference 940
885 ference on Machine Translation, pages 688–725, On- on Empirical Methods in Natural Language Process- 941
886 line. Association for Computational Linguistics. ing (EMNLP), pages 2685–2702, Online. Association 942
for Computational Linguistics. 943
887 Mathias Müller, Annette Rios, and Rico Sennrich. 2020.
888 Domain robustness in neural machine translation. In Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon 944
889 Proceedings of the 14th Conference of the Associa- Lavie. 2020b. Unbabel’s participation in the WMT20 945
890 tion for Machine Translation in the Americas (Volume metrics shared task. In Proceedings of the Fifth Con- 946
891 1: Research Track), pages 151–164, Virtual. Associa- ference on Machine Translation, pages 911–920, On- 947
892 tion for Machine Translation in the Americas. line. Association for Computational Linguistics. 948
893 Mathias Müller and Rico Sennrich. 2021. Understand- Víctor M. Sánchez-Cartagena, Marta Bañón, Sergio 949
894 ing the properties of minimum Bayes risk decoding Ortiz-Rojas, and Gema Ramírez-Sánchez. 2018. 950
895 in neural machine translation. In Proceedings of the Prompsit’s submission to wmt 2018 parallel cor- 951
896 59th Annual Meeting of the Association for Compu- pus filtering shared task. In Proceedings of the 952
897 tational Linguistics and the 11th International Joint Third Conference on Machine Translation, Volume 2: 953
898 Conference on Natural Language Processing (Vol- Shared Task Papers, Brussels, Belgium. Association 954
899 ume 1: Long Papers), pages 259–272, Online. Asso- for Computational Linguistics. 955
900 ciation for Computational Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 956
901 Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, 2016. Neural machine translation of rare words with 957
902 Sam Gross, Nathan Ng, David Grangier, and Michael subword units. In Proceedings of the 54th Annual 958
903 Auli. 2019. fairseq: A fast, extensible toolkit for Meeting of the Association for Computational Lin- 959
904 sequence modeling. In Proceedings of the 2019 Con- guistics (Volume 1: Long Papers), pages 1715–1725, 960
905 ference of the North American Chapter of the Associa- Berlin, Germany. Association for Computational Lin- 961
906 tion for Computational Linguistics (Demonstrations), guistics. 962
907 pages 48–53, Minneapolis, Minnesota. Association
908 for Computational Linguistics. Sofia Serrano and Noah A. Smith. 2019. Is attention in- 963
terpretable? In Proceedings of the 57th Annual Meet- 964
909 Matt Post. 2018. A call for clarity in reporting BLEU ing of the Association for Computational Linguistics, 965
910 scores. In Proceedings of the Third Conference on pages 2931–2951, Florence, Italy. Association for 966
911 Machine Translation: Research Papers, pages 186– Computational Linguistics. 967
912 191, Brussels, Belgium. Association for Computa-
913 tional Linguistics. Libin Shen, Anoop Sarkar, and Franz Josef Och. 2004. 968
Discriminative reranking for machine translation. 969
914 Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham In Proceedings of the Human Language Technol- 970
915 Neubig, and Zachary C. Lipton. 2020. Learning to ogy Conference of the North American Chapter 971
916 deceive with attention-based explanations. In Pro- of the Association for Computational Linguistics: 972
917 ceedings of the 58th Annual Meeting of the Asso- HLT-NAACL 2004, pages 177–184, Boston, Mas- 973
918 ciation for Computational Linguistics, pages 4782– sachusetts, USA. Association for Computational Lin- 974
919 4793, Online. Association for Computational Lin- guistics. 975
920 guistics.
Felix Stahlberg and Bill Byrne. 2019. On NMT search 976
921 Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, errors and model errors: Cat got your tongue? In 977
922 Marta Bañón, and Sergio Ortiz-Rojas. 2020. Bifixer Proceedings of the 2019 Conference on Empirical 978
923 and bicleaner: two open-source tools to clean your Methods in Natural Language Processing and the 979
11
980 9th International Joint Conference on Natural Lan- self-attention: Specialized heads do the heavy lift- 1036
981 guage Processing (EMNLP-IJCNLP), pages 3356– ing, the rest can be pruned. In Proceedings of the 1037
982 3362, Hong Kong, China. Association for Computa- 57th Annual Meeting of the Association for Computa- 1038
983 tional Linguistics. tional Linguistics, pages 5797–5808, Florence, Italy. 1039
Association for Computational Linguistics. 1040
984 Katsuhito Sudoh, Kosuke Takahashi, and Satoshi Naka-
985 mura. 2021. Is this translation error critical?: Chaojun Wang and Rico Sennrich. 2020. On exposure 1041
986 Classification-based human and automatic machine bias, hallucination and domain shift in neural ma- 1042
987 translation evaluation focusing on critical errors. In chine translation. In Proceedings of the 58th Annual 1043
988 Proceedings of the Workshop on Human Evaluation Meeting of the Association for Computational Lin- 1044
989 of NLP Systems (HumEval), pages 46–55, Online. guistics, pages 3544–3552, Online. Association for 1045
990 Association for Computational Linguistics. Computational Linguistics. 1046
991 Kosuke Takahashi, Yoichi Ishibashi, Katsuhito Sudoh, Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not 1047
992 and Satoshi Nakamura. 2021. Multilingual machine not explanation. In Proceedings of the 2019 Confer- 1048
993 translation evaluation metrics fine-tuned on pseudo- ence on Empirical Methods in Natural Language Pro- 1049
994 negative examples for wmt 2021 metrics task. In cessing and the 9th International Joint Conference 1050
995 Proceedings of the Sixth Conference on Machine on Natural Language Processing (EMNLP-IJCNLP), 1051
996 Translation, pages 1049–1052, Online. Association pages 11–20, Hong Kong, China. Association for 1052
997 for Computational Linguistics. Computational Linguistics. 1053
998 Chau Tran, Shruti Bhosale, James Cross, Philipp Koehn, Jianhao Yan, Fandong Meng, and Jie Zhou. 2022. Prob- 1054
999 Sergey Edunov, and Angela Fan. 2021. Facebook ing causes of hallucinations in neural machine trans- 1055
1000 AI’s WMT21 news translation task submission. In lations. 1056
1001 Proceedings of the Sixth Conference on Machine
1002 Translation, pages 205–215, Online. Association for Chrysoula Zerva, Daan van Stigt, Ricardo Rei, Ana C 1057
1003 Computational Linguistics. Farinha, Pedro Ramos, José G. C. de Souza, Taisiya 1058
Glushkova, Miguel Vera, Fabio Kepler, and André 1059
1004 Arata Ugawa, Akihiro Tamura, Takashi Ninomiya, Hi- F. T. Martins. 2021. IST-unbabel 2021 submission 1060
1005 roya Takamura, and Manabu Okumura. 2018. Neural for the quality estimation shared task. In Proceed- 1061
1006 machine translation incorporating named entity. In ings of the Sixth Conference on Machine Translation, 1062
1007 Proceedings of the 27th International Conference on pages 961–972, Online. Association for Computa- 1063
1008 Computational Linguistics, pages 3240–3250, Santa tional Linguistics. 1064
1009 Fe, New Mexico, USA. Association for Computa-
1010 tional Linguistics. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. 1065
Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- 1066
1011 Shikhar Vashishth, Shyam Upadhyay, Gaurav Singh uating text generation with bert. In International 1067
1012 Tomar, and Manaal Faruqui. 2019. Attention inter- Conference on Learning Representations. 1068
1013 pretability across nlp tasks.
Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, 1069
1014 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Francisco Guzmán, Luke Zettlemoyer, and Marjan 1070
1015 Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Ghazvininejad. 2021. Detecting hallucinated content 1071
1016 Kaiser, and Illia Polosukhin. 2017. Attention is all in conditional neural sequence generation. In Find- 1072
1017 you need. In Advances in Neural Information Pro- ings of the Association for Computational Linguis- 1073
1018 cessing Systems, volume 30. Curran Associates, Inc. tics: ACL-IJCNLP 2021, pages 1393–1404, Online. 1074
Association for Computational Linguistics. 1075
1019 Elena Voita, Rico Sennrich, and Ivan Titov. 2021. Ana-
1020 lyzing the source and target contributions to predic-
1021 tions in neural machine translation. In Proceedings
1022 of the 59th Annual Meeting of the Association for
1023 Computational Linguistics and the 11th International
1024 Joint Conference on Natural Language Processing
1025 (Volume 1: Long Papers), pages 1126–1140, Online.
1026 Association for Computational Linguistics.
12
1076 A Experimental setup Metric WMT2014 WMT2017 WMT2018
1077 Data preprocessing. We filter our data using lan- BLEU 31.1 32.6 38.9
1078 guage identification and simple length-heuristics COMET 0.3178 0.3257 0.3340
1079 described in Tran et al. (2021). We encode the
1080 data with byte-pair encoding (Sennrich et al., 2016) Table 3: Evaluation metrics for EN → DE for newstest
sets for the WMT 2014, 2017 and 2018 campaigns.
1081 using the SentencePiece framework (Kudo and
1082 Richardson, 2018). We set the vocabulary size to
Fleiss’s Kappa Scores
1083 32k and compute joint encodings and vocabulary.
COR UG NE OSC SD FD
1084 Model parameters. We follow the setup of 0.62 0.73 0.42 0.86 0.45 0.89
1085 Transformer base model (Vaswani et al., 2017) (hid-
1086 den size of 512, feedforward size of 2048, 6 en- Table 4: Fleiss’s Kappa inter-annotator agreement
1087 coder and 6 decoder layers, 8 attention heads). The scores (↑) for the different categories translators were
1088 model has approximately 77M parameters. prompted to identify.
1115 We perform a rigorous and comprehensive manual Inter-annotator agreement. To determine the 1146
1116 annotation procedure with professional translators, reliability of our annotations, we asked both our 1147
1117 in order to make sure that we are reliably analysing translators to annotate a set of 400 randomly sam- 1148
1118 hallucinated translations. First, the translators were pled translations. For all hallucinatory categories 1149
7
BLEU+case.mixed+lang.XX+numrefs.1+ but SD, the annotators achieved – according to Co- 1150
smooth.exp+tok.13a+version.1.4.2
8 hen’s kappa coefficient (Cohen, 1960) – almost 1151
https://github.com/Unbabel/COMET
9
https://github.com/violet-zct/fairseq-detect- perfect agreement. For all other categories, mod- 1152
hallucination erate to substantial agreement was obtained. This 1153
13
1154 confirms that our data conforms very well to our in- between interactions of different criteria for hallu- 1203
1155 structions. The agreement scores for each category cinations and for other translations: hallucinations 1204
1156 are displayed in Table 4. in our data are most often flagged by multiple cri- 1205
teria simultaneously (e.g. Seq-Logprob flags the 1206
1157 C An Analysis of the Dataset majority of hallucinations detected with Attn-ign- 1207
SRC and MC-DSim), whereas most MT errors and 1208
1158 C.1 High-Level Overview of the Dataset
correct translations are flagged by a single method. 1209
1159 Figure 1 gives a structured overview of dataset This difference in patterns supports our choice of 1210
1160 statistics. First, we see that while we picked trans- taxonomy: properties of hallucinations differ a lot 1211
1161 lations that are likely to be pathological, 60% of from other less severe errors. 1212
1162 the dataset is occupied by correct translations. This
1163 highlights that with the existing methods, finding D Analysis of TokHal-Model 1213
1164 poor translations reliably is still challenging. Next,
1165 note that most of the incorrect translations have TokHal-Model is unfit for natural hallucinations. 1214
1166 translation errors that are not severe enough to Let us recall that the model used for the TokHal- 1215
1167 be deemed hallucinations. This agrees with the Model scores was trained to identify replaced to- 1216
1168 view that hallucinations lie at the extreme end of kens in corrupted translations that are fluent and 1217
1169 the spectrum of MT pathologies (Raunak et al., do not differ much from the original ones (Zhou 1218
1170 2021). Finally, the results of the annotation con- et al., 2021). This means that during training, the 1219
1171 firm that our data selection is very reasonable. In- model was unlikely to observe highly pathological 1220
1172 deed, while previous work has reported halluci- translations that reflect the types of hallucinations 1221
1173 natory rates of 0.2 − 2% in in-domain settings, produced by actual NMT systems. This raises sev- 1222
1174 we see that, for a reasonably numbered collec- eral concerns when using this model in our setting. 1223
1175 tion of examples, our hallucination rate is substan- For example, it might incorrectly flag adequate to- 1224
1176 tially higher (9%). In Figure 1, we also show the kens such as synonyms or paraphrases. What is 1225
1177 method-specific statistics of human annotation re- more, since severely flawed examples are mostly 1226
1178 sults for each heuristic and quality filter. Unsur- out of distribution for the model, labels predicted 1227
1179 prisingly, the long tails of each method display for such translations may be unreasonable. 1228
1180 different characteristics. For example, almost all The results confirm our concerns: Figures 1 1229
1181 translations flagged by Attn-to-EOS are correct, and 5a show that (i) the vast majority of trans- 1230
1182 whereas the proportion of good translations flagged lations flagged by TokHal-Model are correct and 1231
1183 by COMET-QE or Seq-Logprob is rather small. We (ii) it is one of the worst hallucination detectors. 1232
1184 will analyse this further in Section 6. We also pro- On a separate note, in section 7 we will see that 1233
1185 vide additional insights on how hallucinations in it is reasonable for identifying mistranslations of 1234
1186 our data exhibit different characteristics to other named entities, which supports our hypothesis that 1235
1187 less severe MT errors in Appendix C.2. this method is more likely to detect less severe 1236
translation errors rather than hallucinations. 1237
1188 C.2 Hallucinations vs MT Errors
E Patterns of attention maps for 1238
1189 We look separately at the sets of examples with
translations flagged with 1239
1190 hallucinations and other translation errors. Figure 5
attention-based heuristics 1240
1191 shows the structures of these sets with respect to
1192 interactions of the different criteria. The attention maps are shown in Figure 6. While 1241
1193 We see that it is reasonable to consider hallu- both Attn-to-EOS and Attn-ign-SRC were de- 1242
1194 cinations separately from other errors: patterns signed to identify translations for which almost all 1243
1195 in Figures 5a and 5b are substantially different. the attention mass is concentrated on the EOS token, 1244
1196 For example, for translation errors COMET-QE the patterns identified with Attn-ign-SRC are more 1245
1197 performs well, being on par with reference-aware diverse. They span from attention mass coming to 1246
1198 COMET (Figure 5b). However, for hallucinations, various uninformative tokens (e.g., punctuation and 1247
1199 it identifies less than half the amount of examples other tokens as in Figure 6b) to examples shown in 1248
1200 identified not only by COMET, but even by sim- Figure 6c where attention is mostly diagonal (typ- 1249
1201 ple uncertainty-based heuristics (Figure 5a). Fur- ically, these correspond to undergenerations; we 1250
1202 thermore, Figure 5 reveals a significant difference follow this discussion in the next section). 1251
14
(a) Hallucinations (b) MT errors (c) Correct translations
Figure 5: Structure of the sets of translations flagged by the considered methods. Horizontal bars show the proportion
of examples flagged by each method among all translations of the considered category. Each vertical bar shows the
size for the set of translations that are (i) flagged by all the methods marked in the corresponding column and (ii) not
flagged by any of the rest; only the top intersections are shown. Quality filters are shown with diamond marks, and
detection heuristics – with circles. Methods requiring reference translations are shaded.
(a) Correct translation, flagged by (b) Hallucination, flagged by (c) Undergeneration, flagged by
Attn-to-EOS Attn-ign-SRC Attn-ign-SRC
15
Be careful when relying on references. Using 1328
reference information is helpful for detecting hallu- 1329
cinations (see Section 6.1), and while it may not be 1330
used to detect hallucinatins on-the-fly, it may still 1331
prove useful for analysis works. We have found that 1332
high-quality parallel data is critical for adequate ap- 1333
plication of these methods: very low scores might 1334
not only be attributed to poor translations, but also 1335
Figure 7: Our pipeline scheme along with results when to reference mismatches. Indeed, preliminary ex- 1336
we use Seq-Logprob as both the detector and the scoring periments highlighted this worrying trend, which 1337
measure. motivated us to clean the held-out set (Section 4). 1338
Thus, if using reference information to detect hal- 1339
lucinations, make sure to thoroughly clean your 1340
1290 s can be used to build a binary decision rule: for
parallel data. 1341
1291 a given source x, translation ŷ and reference y, a
1292 translation is flagged by the detector if and only
1293 if s(x, ŷ, y) ≤ γ. Naturally, s may not need be a
1294 function of x, ŷ and y (e.g. COMET-QE is only a
1295 function of x and ŷ).
16
OVERWRITING FULLY DETACHED HALLUCINATIONS
S OURCE Handys, die bis auf Wäschewaschen und Staubsaugen scheinbar alles können.
R EFERENCE Mobile phones that can practically do everything except clean the laundry and vacuum clean.
O RIGINAL
The staff were very friendly and helpful.
H YPOTHESIS
OVERWRITTEN
Mobile phones that seem to be able to do everything except on laundry and dustproofing.
H YPOTHESIS
S OURCE In unserem 2 Personen Van mit Dusche/WC war ausreichend Platz für uns beide.
R EFERENCE The space in our 2 person van with shower/toilet was enough for 2 people.
O RIGINAL
The staff were very friendly and helpful. The room was clean and comfortable.
H YPOTHESIS
OVERWRITTEN
In our 2 person van with shower/WC there was enough room for us both.
H YPOTHESIS
S OURCE Ich freue mich über jeden Tag, wo es mir so gut geht!
R EFERENCE I am pleased about each day, where I am so well!
O RIGINAL
I look forward to seeing you every day!
H YPOTHESIS
OVERWRITTEN
I’m very happy with every day I’m doing so well!
H YPOTHESIS
Table 5: Examples of hallucinations of each type that have been overwritten with correct translations.
17