You are on page 1of 17

Looking for a Needle in a Haystack:

A Comprehensive Study of Hallucinations in Neural Machine Translation

Anonymous EACL submission

Abstract ground. First, previous work used multiple and of- 042
ten overlapping definitions and categories of hallu- 043
001 Although the problem of hallucinations in neu- cinations which makes it hard to draw connections 044
002 ral machine translation (NMT) has received between observations made in different works (Lee 045
003 some attention, research on this highly patho-
et al., 2018; Raunak et al., 2021; Zhou et al., 2021). 046
004 logical phenomenon lacks solid ground. Pre-
005 vious work has been limited in several ways: Next, since hallucinations are extremely rare, pre- 047

006 it often resorts to artificial settings where the vious work focused on settings in which the phe- 048
007 problem is amplified, it disregards some (com- nomenon is amplified, e.g. perturbing data either 049
008 mon) types of hallucinations, and it does not in training or at inference, or evaluating under do- 050
009 validate adequacy of detection heuristics. In main shift (Lee et al., 2018; Raunak et al., 2021; 051
010 this paper, we set foundations for the study Müller et al., 2020; Wang and Sennrich, 2020; 052
011 of NMT hallucinations. First, we work in a
Voita et al., 2021; Müller and Sennrich, 2021; Zhou 053
012 natural setting, i.e., in-domain data without
013 artificial noise neither in training nor in in- et al., 2021). Critically, the analysis on these works 054

014 ference. Next, we annotate a dataset of over mostly relied on the adequacy of the automatic hal- 055
015 3.4k sentences indicating different kinds of crit- lucination detection methods they proposed. How- 056
016 ical errors and hallucinations. Then, we turn ever, it is not immediate whether these methods 057
017 to detection methods and both revisit methods translate well to unperturbed settings. 058
018 used previously and propose using glass-box
019 uncertainty-based detectors. Overall, we show In this work, we set foundations for the study of 059
020 that for preventive settings, (i) previously used NMT hallucinations. We take a step back from pre- 060
021 methods are largely inadequate, (ii) sequence vious work and, instead of considering perturbed 061
022 log-probability works best and performs on par settings for which hallucinations are more frequent, 062
023 with reference-based methods. Finally, we pro- we consider a natural scenario and face the actual 063
024 pose D E H ALLUCINATOR, a simple method for problem of identifying a small fraction of halluci- 064
025 alleviating hallucinations at test time which sig-
nations (a “needle”) in a large number of translated 065
026 nificantly reduces the hallucinatory rate.
sentences (a “haystack”). Then, we provide a rig- 066

027 1 Introduction orous comparison among hallucination detection 067


methods. Apart from analysing those proposed in 068
028 Neural machine translation (NMT) is becom- previous work (e.g., heuristics based on anomalous 069
029 ing increasingly accurate (Vaswani et al., 2017; encoder-decoder attention), we also propose to use 070
030 Akhbardeh et al., 2021), particularly in high re- simple model uncertainty measures as detectors. 071
031 source language pairs where parallel data is abun- For each of these methods, we select examples 072
032 dant. However, even the best systems available marked as hallucinations, put them together, and 073
033 today may generate hallucinations. These are ex- gather human annotations. As a result, we intro- 074
034 tremely pathological translations that contain con- duce a corpus of 3415 structured annotations for 075
035 tent that is unfaithful to the source sequence. Criti- different NMT pathologies and hallucinations. We 076
036 cally, a tiny fraction of these mistakes is all it takes use this corpus for analysis and show that, in pre- 077
037 to compromise user trust or safe deployment of ventive settings where high recall is desirable, pre- 078
038 NMT models in production. viously proposed methods are mostly inadequate, 079
039 Unfortunately, although the problem of hallu- and filtering according to standard sequence log- 080
040 cinations received some attention, research on probability performs the best. In fact, it performs 081
041 this highly pathological phenomenon lacks solid on par with the state-of-the-art COMET (Rei et al., 082

1
Category Source Sentence Reference Translation Hallucination
Ist ein Kompromiss aufgrund des zugrundeliegen- The case where, based on the pertinent system
Aporia is the name of aporia , which is
Oscillatory den Regelsystems unmöglich, so spricht man von of regulations a compromise is not possible, is
the name of aporia.
Aporie. referred to as Aporia.
Strongly Tickets für Busse und die U-Bahn ist zu teuer, vor Tickets for buses and the subway is too expen- The hotel is located in the centre of Stock-
Detached allem in Stockholm. sive, especially in Stockholm. holm, close to the train station.
Fully Die Zimmer beziehen, die Fenster mit Aussicht Head up to the rooms, open up the windows
The staff were very friendly and helpful.
Detached öffnen, tief durchatmen, staunen. and savour the view, breathe deeply, marvel.

Table 1: Examples of hallucination types. Hallucinated content is shown shaded.

083 2020a) which uses reference translation and thus ity assessment taxonomies, such as MQM (Lom- 122
084 cannot be used in most real-world applications. Sur- mel et al., 2014), might be unfit or too complex 123
085 prisingly, its reference-free version COMET-QE when we focus only on critical errors and halluci- 124
086 (Rei et al., 2020b), which was shown to generally nations. For hallucinations, in turn, previous work 125
087 perform on par with COMET (Kocmi et al., 2021; used multiple, often overlapping, definitions (Lee 126
088 Freitag et al., 2021), substantially fails to penalise et al., 2018; Raunak et al., 2021; Zhou et al., 2021; 127
089 the severity of hallucinations. Overall, methods tar- Ji et al., 2022; Raunak et al., 2022). The taxonomy 128
090 geting the phenomena are largely unfit, quality esti- we build here is rather general: it covers categories 129
091 mation systems fail, and sequence log-probability, considered previously (Lee et al., 2018; Raunak 130
092 i.e. a byproduct of generating a translation, turns et al., 2021) and others not reported before. 131
093 out to be the best.
094 Apart from our analysis of detection methods, 2.1 Hallucinations 132

095 we propose D E H ALLUCINATOR, a method for alle- To distinguish hallucinations from other errors, we 133
096 viating hallucinations at test time. At a high level, rely on the idea of detachment from the source se- 134
097 we first apply a lightweight hallucination detec- quence. From this perspective, other critical errors 135
098 tor and then, if a translation is flagged, we try to such as mistranslation of named entities are not 136
099 overwrite it with a better version. For this, we considered as hallucinations. 137
100 generate several MC-dropout hypotheses (Gal and
101 Ghahramani, 2016), score them with some mea- Oscillatory hallucinations. These are inade- 138

102 sure, and pick the highest-scoring translation as the quate translations that contain erroneous repetitions 139

103 final candidate. With this approach, the proportion of words and phrases. 140

104 of correct translations among the ones flagged by Largely fluent hallucinations. These are largely 141
105 the detector increases from 33% to 85%, and the fluent translations that are unrelated to the content 142
106 hallucinatory rate decreases threefold. of the source sequence. Previous work assumed 143
107 Overall, we show that (i) in preventive settings, they always bear no relation at all to the source con- 144
108 previously proposed hallucination detectors are tent (Lee et al., 2018; Raunak et al., 2021). How- 145
109 mostly inadequate; (ii) quality estimation tech- ever, we find that a large proportion of fluent hal- 146
110 niques fail to distinguish hallucinations from less lucinations partially support the source. Therefore, 147
111 severe errors; (iii) sequence log-probability is the we also consider severity of a hallucination and 148
112 best hallucination detector and performs on par distinguish translations that are fully detached from 149
113 with reference-based COMET; and, (iv) our D E - those that are strongly (but not fully) detached. 150
114 H ALLUCINATOR significantly alleviates hallucina- Note that oscillatory hallucinations can also be 151
115 tions at test time. either fully or only partially detached, but since 152
116 Our annotated dataset along with the model, these hallucinations are less frequent, in what fol- 153
117 training data, and code are available.1 lows we do not split them by severity. We show 154
examples of these hallucination types in Table 1. 155
118 2 Taxonomy of Translation Pathologies
119 Choosing a good taxonomy is a compromise be- 2.2 Translation errors 156

120 tween simplicity (which minimizes annotation ef- Undergeneration. These are incomplete transla- 157
121 fort) and comprehensiveness. Thus, generic qual- tions that do not cover part of the source content. 158
1
We will release these resources following deanonymiza- This problem is often studied in isolation (Koehn 159
tion. and Knowles, 2017; Stahlberg and Byrne, 2019; 160

2
161 Kumar and Sarawagi, 2019). Undergenerations are not available. Thus, we use reference-based meth- 206
162 sometimes considered as hallucinations (Lee et al., ods to estimate an upper bound for performance of 207
163 2018) but we do not consider them so in our work. the other methods. 208

164 Mistranslation of named entities. Appropri- 3.2 Hallucination Detection Heuristics 209
165 ately translating named entities (e.g. names, dates, 3.2.1 Previously Used Heuristics 210
166 etc.) is also a known difficulty of NMT sys-
Binary-score Heuristics. These heuristics were 211
167 tems (Ugawa et al., 2018; Li et al., 2021; Hu et al.,
used by Raunak et al. (2021) to detect oscillatory 212
168 2022). However, we do not consider it as an hallu-
and fully detached hallucinations. Given a corpus 213
169 cination as it does not show detachment from the
of source-translation pairs, a translation is flagged 214
170 source but rather an incorrect attempt to translate
as an hallucination if it is in the set of 1% lowest- 215
171 part of its content.
quality translations, and if: 216

172 Other errors. These are other incorrect transla- • Top n-gram count (TNG). The count of 217
173 tions that do not fit the categories above. the top repeated n-gram in the translation 218
is greater than the count of the top repeated 219
174 3 Hallucination Detection Methods source n-gram by at least t (in their work, 220

175 Approaches to hallucination detection have a gen- n = 4 and t = 2); 221

176 eral form of • Repeated targets (RT). The translation is 222


repeated for multiple unique source sentences. 223
177 [low quality] ∩ [heuristics], (1)
Anomalous decoder-encoder attention. Atten- 224
178 i.e. low-quality translations that also satisfy addi- tion patterns2 in which most attention mass is con- 225
179 tional constraints. Previous work either relied centrated on the source EOS token are often associ- 226
180 only on quality filtering (Lee et al., 2018; Raunak ated with a model ignoring the source and generat- 227
181 et al., 2021; Müller and Sennrich, 2021), or only ing a hallucinatory translation (Lee et al., 2018; Be- 228
182 on heuristics with no constraints on quality (Berard rard et al., 2019; Raunak et al., 2021). We consider 229
183 et al., 2019), or a combination of the two (Raunak two different criteria targeted to find this pattern: 230
184 et al., 2021). We stick to this general form and
• Attn-to-EOS: the proportion of attention 231
185 consider different quality filters and heuristics.
paid to the EOS source token; 232

186 3.1 Quality Filters • Attn-ign-SRC: the proportion of source 233


187 Quality filters come in two forms: reference-free words with a total incoming attention mass 234
188 and reference-based filters. The latter rely on refer- lower than 0.2. This was used as a data filter- 235
189 ence translations, while the former do not. ing criterion in Berard et al. (2019). 236

190 Reference-free methods. We use the state-of- 3.2.2 Uncertainty-Based Heuristics 237

191 the-art COMET-QE (Rei et al., 2020b) for its Now we describe the uncertainty measures we 238
192 superior performance compared to other met- propose to use as hallucination detectors. Previ- 239
193 rics (Mathur et al., 2020; Freitag et al., 2021; ously, these were used to improve quality assess- 240
194 Kocmi et al., 2021). ments (Fomicheva et al., 2020; Zerva et al., 2021). 241

195 Reference-based methods. Previous work used Sequence log-probability (Seq-Logprob). For 242

196 adjusted BLEU or CHRF2 scores of less than 1% a trained model P (y|x, θ) and a generated transla- 243

197 as a standalone criteria (Lee et al., 2018; Raunak tion y, Seq-Logprob (i.e., model confidence) is the 244

198 et al., 2021; Müller and Sennrich, 2021; Yan et al., length-normalised sequence log-probability: 245

199 2022). In this work, we analyse CHRF2 because it L


1X
200 is more suitable for sentence-level evaluation. In log P (yk | y<k , x, θ). (2) 246
L
201 addition to this lexical metric, we also consider k=1
202 neural COMET (Rei et al., 2020a), a state-of-the- We hypothesise that when hallucinating, a model 247
203 art reference-based metric (Kocmi et al., 2021). is not confident. 248
204 Note that in real-world applications, detecting 2
These patterns are respective to the average of the cross-
205 hallucinations is only needed when references are attention heads of the decoder’s last layer.

3
249 Dissimilarity of MC hypotheses (MC-DSim). it using Bicleaner, the filtering tool used in offi- 293
250 This method measures how the original hypothe- cial releases of filtered ParaCrawl data (Sánchez- 294
251 sis y disagrees with hypotheses {h1 , . . . , hN } gen- Cartagena et al., 2018; Ramírez-Sánchez et al., 295
252 erated in stochastic passes. For the same source 2020; Kreutzer et al., 2022). Following previous 296
253 sentence, we generate these new hypotheses using work, we exclude examples with a score below 0.5 297
254 Monte Carlo (MC) Dropout (Gal and Ghahramani, and end up with about 1.3M examples. 298
255 2016). Then we evaluate the average similarity: All details on preprocessing, hyperparameters 299
and implementation can be found in Appendix A. 300
N
1 X
256 SIM (hi , y). (3) 5 Hallucinations Dataset 301
N
i=1
To analyse the effectiveness of hallucination detec- 302
257 Different similarity measures can be used in place tion criteria, we (i) pick a subset of examples that 303
258 of SIM (e.g. METEOR (Banerjee and Lavie, 2005), are likely to be hallucinations, and (ii) obtain fine- 304
259 BERTScore (Zhang et al., 2020), etc.). We fol- grained annotations from professional translators. 305
260 low previous work and use METEOR with N =
261 10 (Fomicheva et al., 2020; Zerva et al., 2021). 5.1 Data for Annotation 306

Our data selection is motivated by two goals: 307


262 3.3 Trained Hallucination Detection Model
(i) find as many hallucinatory translations as pos- 308
263 An exception from the framework defined in equa- sible – to analyse hallucinations, (ii) pick some 309
264 tion (1) is the work by Zhou et al. (2021) who learn translations from a long tail of hallucination detec- 310
265 to detect token-level hallucinations. Specifically, tion predictions – to analyse the behaviour of these 311
266 the authors create synthetic data where they ran- detection methods. Thus, we first pick 250 worst- 312
267 domly corrupt some tokens in a translation and re- scored samples for each heuristic and quality filter 313
268 construct them with the BART model (Lewis et al., (including binary assignments obtained through 314
269 2020). Then, the authors fine-tune a pretrained TNG and RT). Next, we turn to a broader set of 315
270 language model to identify the replaced tokens. samples and consider translations whose scores 316

271 TokHal-Model. We evaluate the proportion of fall below a chosen percentile for a given method, 317

272 tokens that are predicted to be hallucinated and use i.e. we consider long tails of the scores.3 For in- 318

273 this as a detection score. domain settings, previous work reported hallucina- 319
tory rates of 0.2 − 2%. However, these rates were 320
274 3.4 Binary vs Continuous Scores either obtained on noisy and/or low-resource data 321

275 The methods above fall into two categories: binary- or using weaker models. Therefore, in our cleaner 322

276 score and continuous-score heuristics. The for- in-domain setting with a stronger model, we ex- 323

277 mer (only TNG and RT) output a value in {0, 1}, pect the hallucination rate to lie in the lower end 324

278 whereas the latter output a value in R and the pre- of the indicated range. Thus, we consider approxi- 325

279 diction is made depending on a chosen threshold. mately 0.4% of the worst scores (which amounts to 326
5000 flagged translations for each criteria).4 From 327
280 4 Experimental Setting these, we sample 250 examples and add them to 328
the dataset. In total, we end up with 3415 examples 329
281 Model. We use Transformer base (Vaswani et al.,
for annotation.5 We provide a high-level analysis 330
282 2017) from fairseq (Ott et al., 2019).
of the dataset in Appendix C. 331
283 Data. We use the WMT’18 DE-EN news transla-
5.2 Guidelines and Annotation 332
284 tion data excluding Paracrawl (Bojar et al., 2018) –
285 5.8M sentence pairs. We randomly choose 2/3 of The annotation guidelines are developed according 333
286 the dataset for training and use the remaining 1/3 to the taxonomy defined in Section 2. More details 334
287 as a held-out set for analysis. For validation, we 3
From now on, we refer to examples contained in a long
288 use the newstest2017 dataset. tail of a method as “flagged” or “detected” by this method.
4
For continuous-score heuristics, the threshold for predic-
289 Held-out Data Filtering. We are mainly inter- tion is consistent with this percentile. For practical details,
290 ested in hallucinations produced for clean data. refer to Appendix I.
5
Note that a sample originally obtained from the worst
291 Since our held-out data comes from the WMT’18 scores or sampled from the long tail of a given method may
292 training dataset and thus can be noisy, we filter belong to the long tail of another method.

4
Figure 1: Overall (left) and method-specific (right)
statistics of human annotation results. Method-specific
statistics show the percentages of correct translations
(grey), translation errors (yellow) and hallucinations
(red) among the examples flagged by each method.
Figure 2: Structure of the set of hallucinations flagged
by each criteria. Horizontal bars show the percentage
335 on data collection can be found in Appendix B.
of hallucinations correctly flagged. Each vertical bar
shows the size for the set of hallucinations that are
336 6 Analysing Detection Criteria (i) flagged by all the methods marked in the correspond-
337 In this section, we provide a comprehensive analy- ing column and (ii) not flagged by any of the rest; only
the top intersections are shown. Quality filters are shown
338 sis of the performance of the heuristics and quality
with diamond marks, and detection heuristics are shown
339 filters introduced in Section 3. with circles. Reference-based methods are shaded.
340 6.1 Quality Filters
341 Here we start by analysing reference-based meth- that COMET-QE is one of the worst hallucination 369

342 ods, namely COMET and CHRF2, and then turn detectors, meaning that it fails to rank by the sever- 370

343 to the reference-free COMET-QE. Overall, our re- ity of a translation pathology. This supports the 371

344 sults show that reference information is helpful, hypothesis made in previous work: since quality 372

345 while COMET-QE fails to penalise hallucinations. estimation models are mostly trained on data that 373
lacks negative examples, they may be inadequate 374
346 Reference information helps detection. Fig- for evaluating poor translations (Takahashi et al., 375
347 ure 2 shows that leveraging reference information 2021; Sudoh et al., 2021). 376
348 helps detecting hallucinations: COMET detects Overall, among the considered quality filters, 377
349 more hallucinations than any of the other methods. only COMET may be used as a hallucination detec- 378
350 As expected, lexical-based CHRF2 is significantly tor. However, it is important to keep in mind that, 379
351 worse than neural-based COMET. In fact, it lags in real-world applications, detecting hallucinations 380
352 behind several heuristics (e.g., Seq-Logprob). is only needed when references are not available, 381
353 We also explore whether previously proposed rendering reference-based methods not applicable. 382
354 methods for detecting hallucinations under domain
355 shift are appropriate in our cleaner in-domain set- 6.2 Detection Heuristics 383

356 ting. Specifically, we follow Müller and Sennrich The results in Section 6.1 leave a relevant gap: 384
357 (2021) and consider translations with CHRF2 score where do we turn to when references are not avail- 385
358 lower than 1%. Strikingly, for our clean setting, able? Is there any information, besides quality, that 386
359 this approach is inadequate: in the entire held-out may help detecting hallucinations? Our evidence 387
360 set of 1.3M examples, it flagged only 2 translations. suggests that in preventive settings, previously pro- 388
361 This suggests that methods suitable for noisy set- posed heuristics are mostly inadequate, and uncer- 389
362 tings (e.g., domain shift) might not be applicable in tainty may be the answer for the questions we pose. 390
363 settings where models are less likely to hallucinate. In what follows, we will present the results for the 391
methods described in Section 3.2. For analysis on 392
364 COMET-QE fails to penalise hallucinations.
other detectors, refer to Appendix D. 393
365 From Figure 1 (right) we see that, as expected,
366 most of the translations flagged by COMET-QE Binary-score heuristics perform the worst. Ta- 394
367 are incorrect. However, the vast majority of them ble 2 shows the number of detected translations 395
368 are not hallucinations. Indeed, Figure 2 shows for Top n-gram count (TNG) and Repeated tar- 396

5
Heuristic Correct MT Hallucinations is evidence that attention can play recognizable 440
Errors OSC SD FD roles (Voita et al., 2018, 2019), a lot of work ques- 441
TNG 0 0 32 0 0 tions attention explainability (Wiegreffe and Pinter, 442
RT 18 19 2 1 7
2019; Jain and Wallace, 2019; Serrano and Smith, 443
All dataset 2048 1073 86 90 118
2019; Bastings and Filippova, 2020; Pruthi et al., 444

Table 2: Translations flagged by TNG and RT.


2020). Since hallucinations identified by Attn-ign- 445
SRC are overwhelmingly contained among the ones 446
identified by Seq-Logprob (Figure 2), we recom- 447
397 gets (RT). These heuristics are targeted to iden- mend using the latter instead. 448
398 tify oscillatory and fully detached hallucinations,
Model confidence may be all you need. Fig- 449
399 respectively. We see that while TNG obtains per-
ure 2 shows that Seq-Logprob is the best heuristic 450
400 fect precision, it fails to identify more than half
and performs on par with reference-based COMET. 451
401 of the oscillatory translations. RT, in turn, per-
This means that the less confident the model is, the 452
402 forms poorly across the board: only a few hallu-
more likely it is to generate an inadequate transla- 453
403 cinations are detected, and a significant propor-
tion. This agrees with some observations made in 454
404 tion of flagged translations turn out to be correct.
previous work on quality estimation (Fomicheva 455
405 Moreover, Figure 2 shows that even if we join sets
et al., 2020). Interestingly, such performance of the 456
406 of translations detected by TNG and RT (as done
method contrasts with its simplicity: Seq-Logprob 457
407 in Raunak et al. (2021)), altogether we get fewer
scores are easily obtained as a by-product of gener- 458
408 hallucinations than almost any other considered
ating a translation. This distinguishes the method 459
409 method. Thus, in preventive settings, these meth-
from all the rest that require additional computa- 460
410 ods are highly inadequate.
tion (e.g., corpus-level search for RT or generating 461

411 Anomalous attention is mostly not hallucination. multiple hypotheses for MC-DSim). 462

412 Figures 1 and 2 show that behaviors of Attn-to- MC-DSim: hallucinations are mostly unstable. 463
413 EOS and Attn-ign-SRC are significantly different. The intuition behind this heuristic is simple: when 464
414 First, Attn-to-EOS is not indicative of hallucina- faced with a source sentence for which a good trans- 465
415 tions. Indeed, attention patterns in which most lation is not immediate, the set of hypotheses the 466
416 attention mass is concentrated on the EOS token model “keeps at hand” may be very diverse. Indeed, 467
417 mostly correspond to correct translations. On the MC-DSim performs relatively well and identifies 468
418 other hand, Attn-ign-SRC performs well and is sec- a good proportion of hallucinations (Figures 1, 2). 469
419 ond only to uncertainty-based Seq-Logprob. Such Thus, most hallucinations are unstable: dissimilar- 470
420 a difference in performance is very surprising: both ity of MC hypotheses helps to identify them. 471
421 methods are motivated by a common belief that if
422 almost all the attention mass is concentrated on 6.3 Combining Heuristics and Quality Filters 472

423 the source EOS token, a translation is likely to be Intersecting a set of translations obtained via a 473
424 a hallucination (Berard et al., 2019; Raunak et al., heuristic with the bottom scored translations ac- 474
425 2021). In fact, both methods were designed to cording to a quality filter was introduced in Raunak 475
426 identify this specific pattern. However, patterns et al. (2021). The motivation for this was to avoid 476
427 identified with Attn-ign-SRC span from attention incorrectly flagging good translations as halluci- 477
428 mass coming to various uninformative tokens (e.g., nations (e.g., without this intersection, RT flags 478
429 punctuation) to examples where attention is mostly good-quality paraphrases). Intuitively, this idea is 479
430 diagonal (typically, these correspond to undergen- very reasonable: hallucinations are indeed incor- 480
431 erations). We show examples of such attention rect translations. However, implementing this in 481
432 maps in Appendix E. Overall, the results highlight practice requires a good quality estimation model, 482
433 a disparity between what is believed to indicate specifically for ranking poor translations. Unfor- 483
434 hallucinations and what actually indicates them. tunately, we showed in Section 6.1 that for this 484
435 Note that while Attn-ign-SRC performs rel- purpose, even the state-of-the-art COMET-QE is 485
436 atively well, it should be used with caution: largely inadequate. This means that using such 486
437 attention-based heuristics rely on the assumption quality estimates may lead to filtering out a lot of 487
438 that attention patterns reflect model reasoning. hallucinations, which is not desirable in preven- 488
439 This assumption is not reliable: although there tive settings. Our results show that this is exactly 489

6
(a) Binary-score heuristics (b) Continuous-score heuristics (c) Quality filters

Figure 3: Distribution of the translations flagged by each method conditioned on each of the considered pathologies.

490 the case: e.g., filtering with COMET-QE leads to during training. Therefore, fully detached halluci- 526
491 losing nearly 80% of hallucinations detected by nations are indeed likely to be traced back to the 527
492 Seq-Logprob. All in all, such an intersection gener- training data, but they emerge in non trivial ways 528
493 ally does more harm than good. and not necessarily as exact copies. This can be 529
seen as one more evidence that, when dealing with 530
494 7 Analysing Hallucination Pathologies memorisation in language models, it is necessary 531
to consider not just full copies in the training data 532
495 In this section, we look at hallucination pathologies
but also near-duplicates (Lee et al., 2022). 533
496 in isolation and show that the behavior of detection
497 methods varies depending on the type of pathology. Strongly detached hallucinations. As expected, 534
498 For example, for a given pathology, some methods this pathology is harder to detect than fully de- 535
499 may be specialised, whereas others may fail. For a tached hallucinations (Figure 3). However, the 536
500 similar analysis on other critical translation errors, trends are largely similar: for example, COMET- 537
501 refer to Appendix F. QE fails again, and Seq-Logprob performs best and 538
outperforms even reference-based COMET. 539
502 Fully detached hallucinations. Figure 3 shows
503 that these hallucinations are easily detected by sev- Oscillatory hallucinations. The method specif- 540
504 eral methods, e.g. Seq-Lobprob, Attn-ign-SRC, ically developed to detect this hallucination 541
505 COMET, CHRF2. This is not surprising: intu- type (Top n-gram count, TNG) performs worse 542
506 itively, the most severe pathology should be the than most of the other methods. Among the rest, 543
507 easiest to detect. However, COMET-QE fails to COMET performs best. Interestingly, in contrast to 544
508 identify almost all these hallucinations. While previous observations, COMET-QE performs well, 545
509 COMET-based metrics are known to not penalise being on par with COMET. 546
510 enough certain types of errors (e.g., discrepancies
511 in numbers and named entities; see Amrhein and 8 D E H ALLUCINATOR: Overwriting 547
512 Sennrich (2022); Raunak et al. (2022)), such poor Hallucinations at Test Time 548
513 performance for completely inadequate translations
514 is highly unexpected. This calls for further research In previous sections, we saw that hallucinations 549
515 on the behavior of quality estimation models. are more unstable than other translations: for them, 550
516 On a more general note, let us recall that pre- generated MC-dropout hypotheses tend to differ 551
517 vious work suggested that fully detached halluci- a lot. This motivates us to look more closely into 552
518 nations emerge as exact copies of references from these hypotheses: are any of these translations not 553
519 training data (Raunak et al., 2021). We validate hallucinations? Answering this question not only 554
520 this hypothesis and find the contrary: out of the 44 gives insight into inner workings of hallucinating 555
521 unique translations marked as fully detached from NMT models, but also leads to an interesting prac- 556
522 the source, only 4 are exact copies of references tical application – overwriting hallucinations at in- 557
523 in the training data. Nevertheless, when looking ference time. This is of utmost importance for pro- 558
524 at these sentences more closely, we see that they duction systems where hallucinations have deeply 559
525 do contain large substrings that are seen frequently compromising effect on user trust. 560

7
written with correct translations. This is surprising: 599
in most cases, the model is not stuck in a hallucina- 600
tory mode and can generate good translations in a 601
small vicinity of model parameters. In this sense, 602
most hallucinations result from “bad luck” during 603
generation and not profound model defect. Note 604
that fully detached hallucinations are the hardest to 605
improve. This makes sense as these are likely to be 606
Figure 4: Our pipeline scheme along with results. traced back to (near-)duplicates in the training data 607
and, therefore, they do highlight model anomalies. 608

Note that in this pipeline, we overwrite not only 609


561 Whenever flagged, overwrite with better. Intu-
hallucinations but also other errors and correct 610
562 itively, our idea is similar to hybrid pipelines when
translations that were flagged by the detector. This 611
563 a machine-generated translation is first passed to
means that our method needs to appropriately han- 612
564 a quality estimation system and then, if needed, is
dle such translations. From Figure 4 we see a pleas- 613
565 corrected by human translators. In our case, we
ant side-effect: our approach overwrites most of the 614
566 first apply a hallucination detector and then, if a
errors with correct translations. Just as importantly, 615
567 translation is flagged, we try to overwrite it with a
almost all originally correct translations remain cor- 616
568 better translation (Figure 4). For this, we generate
rect. Overall, the proportion of correct translations 617
569 several MC-dropout hypotheses, score them with
among the ones flagged by the detector increases 618
570 some measure, and pick the highest-scoring trans-
from 33% to 85%, and the hallucinatory rate de- 619
571 lation as a final candidate (in the spirit of reranking
creases threefold. In Appendix H, we show several 620
572 approaches (Shen et al., 2004; Lee et al., 2021;
examples of overwritten hallucinations. 621
573 Fernandes et al., 2022; Freitag et al., 2022)).
574 The general pipeline above relies on the choice
575 of a hallucination detector and a scoring measure. 9 Conclusions 622
576 For the detector, we use the best of the analysed de-
577 tectors, i.e. Seq-Logprob.6 For the scoring measure, Dealing with hallucinations is difficult. First, we 623
578 a natural choice would be a quality estimation sys- had to take a step back from previous work and 624
579 tem: by construction, these systems are designed to refuse procedures that amplify the problem, as 625
580 score translations according to quality. However, as these hinder the behavior of models in their stan- 626
581 we saw earlier, even the state-of-the-art COMET- dard settings. After that, we notice that work on 627
582 QE may fail (Section 6.1). Therefore, we compare detection often relies on assumptions that remained 628
583 two measures: COMET-QE and Seq-Logprob. unquestioned (e.g., generic quality measures, tar- 629
584 In this experiment, we randomly choose geted heuristics, anomalous attention being suitable 630
585 200 translations from our dataset flagged by Seq- detectors). Through extensive experiments, we es- 631
586 Logprob. For each, we generate 10 MC-dropout hy- tablish order in detection methods. Surprisingly, 632
587 potheses. Then, for the overwritten translations we despite introduction of several methods specifically 633
588 gather annotations according to our guidelines (Sec- targeted for hallucinations, what works best has 634
589 tion 5). The results are summarized in Figure 4. always been at our disposal: standard sequence log- 635
590 Although we were concerned about COMET-QE probability. This suggests that characteristics in- 636
591 because of its low performance when ranking poor nate to a model can have a lot of value. In fact, such 637
592 translations, we find that for choosing the best hy- characteristics are the backbone of our D E H ALLU - 638
593 pothesis, it is indeed appropriate and performs bet- CINATOR, a lightweight approach that significantly 639
594 ter than Seq-Logprob. We thus show results with alleviates hallucinations at test time. This leaves 640
595 COMET-QE scores in the main text and with Seq- space for future research on model uncertainty, hal- 641
596 Logprob in Appendix G. lucination prevention, understanding where halluci- 642
nations come from, among others. For this, we re- 643
597 Most hallucinations and errors become correct. lease our corpus with structured annotations along 644
598 Figure 4 shows that most hallucinations are over- with the model and its training data. Altogether, 645
6
In this experiment, we take the translations from our this allows us to say that we provide solid ground 646
dataset and consider the percentiles defined in Section 5. for future study of hallucinations in NMT. 647

8
648 Limitations Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, 698
Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp 699
649 We highlight three main limitations in our work. Koehn, Nicholas Lourie, Christof Monz, Makoto 700
650 First, although the foundation of our proposed tax- Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki 701
Nakazawa, Matteo Negri, Santanu Pal, Allahsera Au- 702
651 onomy for hallucinations rests on the idea of de- guste Tapo, Marco Turchi, Valentin Vydrin, and Mar- 703
652 tachment from the source content, we do not eval- cos Zampieri. 2021. Findings of the 2021 conference 704
653 uate it quantitatively. Indeed, we cannot point on machine translation (WMT21). In Proceedings of 705
654 whether the model that generated the hallucina- the Sixth Conference on Machine Translation, pages 706
1–88, Online. Association for Computational Linguis- 707
655 tions was indeed detached from the source sen-
tics. 708
656 tence when generating them. Nevertheless, we
657 can guarantee that the hallucinations in our dataset Chantal Amrhein and Rico Sennrich. 2022. Identifying 709
658 are translations that are detached from the source weaknesses in machine translation metrics through 710
minimum bayes risk decoding: A case study for 711
659 content according to professional translators. We comet. 712
660 consider that the quantitative analysis of the detach-
661 ment to be out of the scope of this paper. Still, it Satanjeev Banerjee and Alon Lavie. 2005. METEOR: 713
662 constitutes an interesting line for future research An automatic metric for MT evaluation with im- 714
proved correlation with human judgments. In Pro- 715
663 on understanding hallucinations in NMT that may ceedings of the ACL Workshop on Intrinsic and Ex- 716
664 be facilitated with the release of our code, model trinsic Evaluation Measures for Machine Transla- 717
665 and annotated dataset. tion and/or Summarization, pages 65–72, Ann Arbor, 718
666 Second, while this paper comprehensively stud- Michigan. Association for Computational Linguis- 719
tics. 720
667 ies the phenomena of hallucinations in NMT for a
668 high-resource language pair, experiments in more Jasmijn Bastings and Katja Filippova. 2020. The ele- 721
669 language pairs (including low-resource languages) phant in the interpretability room: Why use attention 722
as explanation when we have saliency methods? In 723
670 are necessary to assess the broad validity of our
Proceedings of the Third BlackboxNLP Workshop 724
671 claims. To keep our setup familiar to researchers on Analyzing and Interpreting Neural Networks for 725
672 and practicioners, we opted for a familiar language NLP, pages 149–155, Online. Association for Com- 726
673 pair for which data is widely available. Moreover putational Linguistics. 727
674 our choice also facilitated the data collection pro- Alexandre Berard, Ioan Calapodescu, and Claude Roux. 728
675 cess as there is a large supply of professional trans- 2019. Naver labs Europe’s systems for the WMT19 729
676 lators for this language pair. machine translation robustness task. In Proceedings 730
677 Lastly, instead of focusing on more recent NMT of the Fourth Conference on Machine Translation 731
(Volume 2: Shared Task Papers, Day 1), pages 526– 732
678 models that use large pretrained language models 532, Florence, Italy. Association for Computational 733
679 as their backbone, we focused on a Transformer Linguistics. 734
680 base model. The reason for this choice is that we
681 wanted to keep the setup simple, familiar, easy to Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette 735
Graham, Barry Haddow, Philipp Koehn, and Christof 736
682 reproduce, and computationally economical. More- Monz. 2018. Findings of the 2018 conference on ma- 737
683 over, it was important for our work to have full chine translation (WMT18). In Proceedings of the 738
684 control on the training and held-out data. Neverthe- Third Conference on Machine Translation: Shared 739
685 less, research on hallucinations on more recent and Task Papers, pages 272–303, Belgium, Brussels. As- 740
sociation for Computational Linguistics. 741
686 powerful NMT models is an exciting line of future
687 work and we hope our work spurs that research. Jacob Cohen. 1960. A coefficient of agreement for 742
688 All code and annotated data samples will be pub- nominal scales. Educational and Psychological Mea- 743
surement, 20(1):37–46. 744
689 licly released following deanonymization.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, 745
Vishrav Chaudhary, Guillaume Wenzek, Francisco 746
690 References Guzmán, Edouard Grave, Myle Ott, Luke Zettle- 747
moyer, and Veselin Stoyanov. 2019. Unsupervised 748
691 Farhad Akhbardeh, Arkady Arkhangorodsky, Mag- cross-lingual representation learning at scale. 749
692 dalena Biesialska, Ondřej Bojar, Rajen Chatter-
693 jee, Vishrav Chaudhary, Marta R. Costa-jussa, Patrick Fernandes, António Farinhas, Ricardo Rei, José 750
694 Cristina España-Bonet, Angela Fan, Christian Fe- De Souza, Perez Ogayo, Graham Neubig, and Andre 751
695 dermann, Markus Freitag, Yvette Graham, Ro- Martins. 2022. Quality-aware decoding for neural 752
696 man Grundkiewicz, Barry Haddow, Leonie Harter, machine translation. In Proceedings of the 2022 753
697 Kenneth Heafield, Christopher Homan, Matthias Conference of the North American Chapter of the 754

9
755 Association for Computational Linguistics: Human pages 28–39, Vancouver. Association for Computa- 811
756 Language Technologies, pages 1396–1412, Seattle, tional Linguistics. 812
757 United States. Association for Computational Lin-
758 guistics. Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, 813
Daan van Esch, Nasanbayar Ulzii-Orshikh, Allah- 814
759 Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, sera Tapo, Nishant Subramani, Artem Sokolov, Clay- 815
760 Frédéric Blain, Francisco Guzmán, Mark Fishel, tone Sikasote, Monang Setyawan, Supheakmungkol 816
761 Nikolaos Aletras, Vishrav Chaudhary, and Lucia Spe- Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, An- 817
762 cia. 2020. Unsupervised quality estimation for neural nette Rios, Isabel Papadimitriou, Salomey Osei, Pe- 818
763 machine translation. Transactions of the Association dro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, An- 819
764 for Computational Linguistics, 8:539–555. dre Niyongabo Rubungo, Toan Q. Nguyen, Math- 820
ias Müller, André Müller, Shamsuddeen Hassan 821
765 Markus Freitag, David Grangier, Qijun Tan, and Bowen Muhammad, Nanda Muhammad, Ayanda Mnyak- 822
766 Liang. 2022. High Quality Rather than High Model eni, Jamshidbek Mirzakhalov, Tapiwanashe Matan- 823
767 Probability: Minimum Bayes Risk Decoding with gira, Colin Leong, Nze Lawson, Sneha Kudugunta, 824
768 Neural Metrics. Transactions of the Association for Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaven- 825
769 Computational Linguistics, 10:811–825. ture F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, 826
Sakine Çabuk Ballı, Stella Biderman, Alessia Bat- 827
770 Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, tisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, 828
771 Craig Stewart, George Foster, Alon Lavie, and Ondřej Israel Abebe Azime, Ayodele Awokoya, Duygu Ata- 829
772 Bojar. 2021. Results of the WMT21 metrics shared man, Orevaoghene Ahia, Oghenefego Ahia, Sweta 830
773 task: Evaluating metrics with expert-based human Agrawal, and Mofetoluwa Adeyemi. 2022. Quality 831
774 evaluations on TED and news domain. In Proceed- at a Glance: An Audit of Web-Crawled Multilingual 832
775 ings of the Sixth Conference on Machine Translation, Datasets. Transactions of the Association for Com- 833
776 pages 733–774, Online. Association for Computa- putational Linguistics, 10:50–72. 834
777 tional Linguistics.
Taku Kudo and John Richardson. 2018. SentencePiece: 835
778 Yarin Gal and Zoubin Ghahramani. 2016. Dropout as
A simple and language independent subword tok- 836
779 a bayesian approximation: Representing model un-
enizer and detokenizer for neural text processing. In 837
780 certainty in deep learning. In Proceedings of The
Proceedings of the 2018 Conference on Empirical 838
781 33rd International Conference on Machine Learn-
Methods in Natural Language Processing: System 839
782 ing, volume 48 of Proceedings of Machine Learning
Demonstrations, pages 66–71, Brussels, Belgium. 840
783 Research, pages 1050–1059, New York, New York,
Association for Computational Linguistics. 841
784 USA. PMLR.

785 Junjie Hu, Hiroaki Hayashi, Kyunghyun Cho, and Gra- Aviral Kumar and Sunita Sarawagi. 2019. Calibration 842
786 ham Neubig. 2022. DEEP: DEnoising entity pre- of encoder decoder models for neural machine trans- 843
787 training for neural machine translation. In Proceed- lation. CoRR, abs/1903.00802. 844
788 ings of the 60th Annual Meeting of the Association
789 for Computational Linguistics (Volume 1: Long Pa- Ann Lee, Michael Auli, and Marc’Aurelio Ranzato. 845
790 pers), pages 1753–1766, Dublin, Ireland. Association 2021. Discriminative reranking for neural machine 846
791 for Computational Linguistics. translation. In Proceedings of the 59th Annual Meet- 847
ing of the Association for Computational Linguistics 848
792 Sarthak Jain and Byron C. Wallace. 2019. Attention is and the 11th International Joint Conference on Natu- 849
793 not Explanation. In Proceedings of the 2019 Con- ral Language Processing (Volume 1: Long Papers), 850
794 ference of the North American Chapter of the Asso- pages 7250–7264, Online. Association for Computa- 851
795 ciation for Computational Linguistics: Human Lan- tional Linguistics. 852
796 guage Technologies, Volume 1 (Long and Short Pa-
797 pers), pages 3543–3556, Minneapolis, Minnesota. Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fan- 853
798 Association for Computational Linguistics. njiang, and David Sussillo. 2018. Hallucinations in 854
neural machine translation. 855
799 Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu,
800 Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Katherine Lee, Daphne Ippolito, Andrew Nystrom, 856
801 Madotto, and Pascale Fung. 2022. Survey of halluci- Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, 857
802 nation in natural language generation. and Nicholas Carlini. 2022. Deduplicating training 858
data makes language models better. In Proceedings 859
803 Tom Kocmi, Christian Federmann, Roman Grund- of the 60th Annual Meeting of the Association for 860
804 kiewicz, Marcin Junczys-Dowmunt, Hitokazu Mat- Computational Linguistics (Volume 1: Long Papers), 861
805 sushita, and Arul Menezes. 2021. To ship or not to pages 8424–8445, Dublin, Ireland. Association for 862
806 ship: An extensive evaluation of automatic metrics Computational Linguistics. 863
807 for machine translation.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan 864
808 Philipp Koehn and Rebecca Knowles. 2017. Six chal- Ghazvininejad, Abdelrahman Mohamed, Omer Levy, 865
809 lenges for neural machine translation. In Proceedings Veselin Stoyanov, and Luke Zettlemoyer. 2020. 866
810 of the First Workshop on Neural Machine Translation, BART: Denoising sequence-to-sequence pre-training 867

10
868 for natural language generation, translation, and com- parallel data. In Proceedings of the 22nd Annual 924
869 prehension. In Proceedings of the 58th Annual Meet- Conference of the European Association for Machine 925
870 ing of the Association for Computational Linguistics, Translation, pages 291–298, Lisboa, Portugal. Euro- 926
871 pages 7871–7880, Online. Association for Computa- pean Association for Machine Translation. 927
872 tional Linguistics.
Vikas Raunak, Arul Menezes, and Marcin Junczys- 928
873 Panpan Li, Mengxiang Wang, and Jian Wang. 2021. Dowmunt. 2021. The curious case of hallucinations 929
874 Named entity translation method based on ma- in neural machine translation. In Proceedings of 930
875 chine translation lexicon. Neural Comput. Appl., the 2021 Conference of the North American Chap- 931
876 33(9):3977–3985. ter of the Association for Computational Linguistics: 932
Human Language Technologies, pages 1172–1183, 933
877 Arle Lommel, Aljoscha Burchardt, and Hans Uszkor- Online. Association for Computational Linguistics. 934
878 eit. 2014. Multidimensional quality metrics (mqm):
879 A framework for declaring and describing transla- Vikas Raunak, Matt Post, and Arul Menezes. 2022. 935
880 tion quality metrics. Tradumàtica: tecnologies de la Salted: A framework for salient long-tail translation 936
881 traducció, 0:455–463. error detection. 937

882 Nitika Mathur, Johnny Wei, Markus Freitag, Qingsong Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon 938
883 Ma, and Ondřej Bojar. 2020. Results of the WMT20 Lavie. 2020a. COMET: A neural framework for MT 939
884 metrics shared task. In Proceedings of the Fifth Con- evaluation. In Proceedings of the 2020 Conference 940
885 ference on Machine Translation, pages 688–725, On- on Empirical Methods in Natural Language Process- 941
886 line. Association for Computational Linguistics. ing (EMNLP), pages 2685–2702, Online. Association 942
for Computational Linguistics. 943
887 Mathias Müller, Annette Rios, and Rico Sennrich. 2020.
888 Domain robustness in neural machine translation. In Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon 944
889 Proceedings of the 14th Conference of the Associa- Lavie. 2020b. Unbabel’s participation in the WMT20 945
890 tion for Machine Translation in the Americas (Volume metrics shared task. In Proceedings of the Fifth Con- 946
891 1: Research Track), pages 151–164, Virtual. Associa- ference on Machine Translation, pages 911–920, On- 947
892 tion for Machine Translation in the Americas. line. Association for Computational Linguistics. 948

893 Mathias Müller and Rico Sennrich. 2021. Understand- Víctor M. Sánchez-Cartagena, Marta Bañón, Sergio 949
894 ing the properties of minimum Bayes risk decoding Ortiz-Rojas, and Gema Ramírez-Sánchez. 2018. 950
895 in neural machine translation. In Proceedings of the Prompsit’s submission to wmt 2018 parallel cor- 951
896 59th Annual Meeting of the Association for Compu- pus filtering shared task. In Proceedings of the 952
897 tational Linguistics and the 11th International Joint Third Conference on Machine Translation, Volume 2: 953
898 Conference on Natural Language Processing (Vol- Shared Task Papers, Brussels, Belgium. Association 954
899 ume 1: Long Papers), pages 259–272, Online. Asso- for Computational Linguistics. 955
900 ciation for Computational Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 956
901 Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, 2016. Neural machine translation of rare words with 957
902 Sam Gross, Nathan Ng, David Grangier, and Michael subword units. In Proceedings of the 54th Annual 958
903 Auli. 2019. fairseq: A fast, extensible toolkit for Meeting of the Association for Computational Lin- 959
904 sequence modeling. In Proceedings of the 2019 Con- guistics (Volume 1: Long Papers), pages 1715–1725, 960
905 ference of the North American Chapter of the Associa- Berlin, Germany. Association for Computational Lin- 961
906 tion for Computational Linguistics (Demonstrations), guistics. 962
907 pages 48–53, Minneapolis, Minnesota. Association
908 for Computational Linguistics. Sofia Serrano and Noah A. Smith. 2019. Is attention in- 963
terpretable? In Proceedings of the 57th Annual Meet- 964
909 Matt Post. 2018. A call for clarity in reporting BLEU ing of the Association for Computational Linguistics, 965
910 scores. In Proceedings of the Third Conference on pages 2931–2951, Florence, Italy. Association for 966
911 Machine Translation: Research Papers, pages 186– Computational Linguistics. 967
912 191, Brussels, Belgium. Association for Computa-
913 tional Linguistics. Libin Shen, Anoop Sarkar, and Franz Josef Och. 2004. 968
Discriminative reranking for machine translation. 969
914 Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham In Proceedings of the Human Language Technol- 970
915 Neubig, and Zachary C. Lipton. 2020. Learning to ogy Conference of the North American Chapter 971
916 deceive with attention-based explanations. In Pro- of the Association for Computational Linguistics: 972
917 ceedings of the 58th Annual Meeting of the Asso- HLT-NAACL 2004, pages 177–184, Boston, Mas- 973
918 ciation for Computational Linguistics, pages 4782– sachusetts, USA. Association for Computational Lin- 974
919 4793, Online. Association for Computational Lin- guistics. 975
920 guistics.
Felix Stahlberg and Bill Byrne. 2019. On NMT search 976
921 Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, errors and model errors: Cat got your tongue? In 977
922 Marta Bañón, and Sergio Ortiz-Rojas. 2020. Bifixer Proceedings of the 2019 Conference on Empirical 978
923 and bicleaner: two open-source tools to clean your Methods in Natural Language Processing and the 979

11
980 9th International Joint Conference on Natural Lan- self-attention: Specialized heads do the heavy lift- 1036
981 guage Processing (EMNLP-IJCNLP), pages 3356– ing, the rest can be pruned. In Proceedings of the 1037
982 3362, Hong Kong, China. Association for Computa- 57th Annual Meeting of the Association for Computa- 1038
983 tional Linguistics. tional Linguistics, pages 5797–5808, Florence, Italy. 1039
Association for Computational Linguistics. 1040
984 Katsuhito Sudoh, Kosuke Takahashi, and Satoshi Naka-
985 mura. 2021. Is this translation error critical?: Chaojun Wang and Rico Sennrich. 2020. On exposure 1041
986 Classification-based human and automatic machine bias, hallucination and domain shift in neural ma- 1042
987 translation evaluation focusing on critical errors. In chine translation. In Proceedings of the 58th Annual 1043
988 Proceedings of the Workshop on Human Evaluation Meeting of the Association for Computational Lin- 1044
989 of NLP Systems (HumEval), pages 46–55, Online. guistics, pages 3544–3552, Online. Association for 1045
990 Association for Computational Linguistics. Computational Linguistics. 1046

991 Kosuke Takahashi, Yoichi Ishibashi, Katsuhito Sudoh, Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not 1047
992 and Satoshi Nakamura. 2021. Multilingual machine not explanation. In Proceedings of the 2019 Confer- 1048
993 translation evaluation metrics fine-tuned on pseudo- ence on Empirical Methods in Natural Language Pro- 1049
994 negative examples for wmt 2021 metrics task. In cessing and the 9th International Joint Conference 1050
995 Proceedings of the Sixth Conference on Machine on Natural Language Processing (EMNLP-IJCNLP), 1051
996 Translation, pages 1049–1052, Online. Association pages 11–20, Hong Kong, China. Association for 1052
997 for Computational Linguistics. Computational Linguistics. 1053

998 Chau Tran, Shruti Bhosale, James Cross, Philipp Koehn, Jianhao Yan, Fandong Meng, and Jie Zhou. 2022. Prob- 1054
999 Sergey Edunov, and Angela Fan. 2021. Facebook ing causes of hallucinations in neural machine trans- 1055
1000 AI’s WMT21 news translation task submission. In lations. 1056
1001 Proceedings of the Sixth Conference on Machine
1002 Translation, pages 205–215, Online. Association for Chrysoula Zerva, Daan van Stigt, Ricardo Rei, Ana C 1057
1003 Computational Linguistics. Farinha, Pedro Ramos, José G. C. de Souza, Taisiya 1058
Glushkova, Miguel Vera, Fabio Kepler, and André 1059
1004 Arata Ugawa, Akihiro Tamura, Takashi Ninomiya, Hi- F. T. Martins. 2021. IST-unbabel 2021 submission 1060
1005 roya Takamura, and Manabu Okumura. 2018. Neural for the quality estimation shared task. In Proceed- 1061
1006 machine translation incorporating named entity. In ings of the Sixth Conference on Machine Translation, 1062
1007 Proceedings of the 27th International Conference on pages 961–972, Online. Association for Computa- 1063
1008 Computational Linguistics, pages 3240–3250, Santa tional Linguistics. 1064
1009 Fe, New Mexico, USA. Association for Computa-
1010 tional Linguistics. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. 1065
Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- 1066
1011 Shikhar Vashishth, Shyam Upadhyay, Gaurav Singh uating text generation with bert. In International 1067
1012 Tomar, and Manaal Faruqui. 2019. Attention inter- Conference on Learning Representations. 1068
1013 pretability across nlp tasks.
Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, 1069
1014 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Francisco Guzmán, Luke Zettlemoyer, and Marjan 1070
1015 Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Ghazvininejad. 2021. Detecting hallucinated content 1071
1016 Kaiser, and Illia Polosukhin. 2017. Attention is all in conditional neural sequence generation. In Find- 1072
1017 you need. In Advances in Neural Information Pro- ings of the Association for Computational Linguis- 1073
1018 cessing Systems, volume 30. Curran Associates, Inc. tics: ACL-IJCNLP 2021, pages 1393–1404, Online. 1074
Association for Computational Linguistics. 1075
1019 Elena Voita, Rico Sennrich, and Ivan Titov. 2021. Ana-
1020 lyzing the source and target contributions to predic-
1021 tions in neural machine translation. In Proceedings
1022 of the 59th Annual Meeting of the Association for
1023 Computational Linguistics and the 11th International
1024 Joint Conference on Natural Language Processing
1025 (Volume 1: Long Papers), pages 1126–1140, Online.
1026 Association for Computational Linguistics.

1027 Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan


1028 Titov. 2018. Context-aware neural machine trans-
1029 lation learns anaphora resolution. In Proceedings
1030 of the 56th Annual Meeting of the Association for
1031 Computational Linguistics (Volume 1: Long Papers),
1032 pages 1264–1274, Melbourne, Australia. Association
1033 for Computational Linguistics.

1034 Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-


1035 nrich, and Ivan Titov. 2019. Analyzing multi-head

12
1076 A Experimental setup Metric WMT2014 WMT2017 WMT2018

1077 Data preprocessing. We filter our data using lan- BLEU 31.1 32.6 38.9
1078 guage identification and simple length-heuristics COMET 0.3178 0.3257 0.3340
1079 described in Tran et al. (2021). We encode the
1080 data with byte-pair encoding (Sennrich et al., 2016) Table 3: Evaluation metrics for EN → DE for newstest
sets for the WMT 2014, 2017 and 2018 campaigns.
1081 using the SentencePiece framework (Kudo and
1082 Richardson, 2018). We set the vocabulary size to
Fleiss’s Kappa Scores
1083 32k and compute joint encodings and vocabulary.
COR UG NE OSC SD FD
1084 Model parameters. We follow the setup of 0.62 0.73 0.42 0.86 0.45 0.89
1085 Transformer base model (Vaswani et al., 2017) (hid-
1086 den size of 512, feedforward size of 2048, 6 en- Table 4: Fleiss’s Kappa inter-annotator agreement
1087 coder and 6 decoder layers, 8 attention heads). The scores (↑) for the different categories translators were
1088 model has approximately 77M parameters. prompted to identify.

1089 Optimizer. Similarly to (Vashishth et al., 2019),


1090 we train our model using the Adam optimizer with asked to be familiar with our task by reading our 1119
1091 β1 = 0.9 and β2 = 0.98 and use an inverse square provided annotation examples, along with detailed 1120
1092 root learning rate scheduler with initial value 5 × annotation instructions. Then they had to pass a 1121
1093 10−4 , and a linear warm-up in the first 4000 steps. test to show they can recognize different patholo- 1122
gies in different translations. We selected the two 1123
1094 Training and Inference. Models are trained for best translators to gather the annotations. After 1124
1095 250K updates with a batch size of about 32K to- being hired, we ran three one-hour tutorial sessions 1125
1096 kens. We set dropout to 0.3. At inference time, to explain the task thoroughly and to clarify any 1126
1097 we produce translations using beam search with a possible questions. During the annotation process, 1127
1098 beam of 5. We validate our models during training we made sure to be promptly available to answer 1128
1099 using SacreBLEU (Post, 2018), and we choose the any question from the translators. We paid a fair 1129
1100 checkpoint based on best BLEU in validation. We wage (25-30 USD per hour) – well above the US 1130
1101 provide BLEU7 and COMET baselines on WMT federal minimum – and inspected their work for 1131
1102 evaluation campaigns in Table 3. We train and quality. 1132
1103 performance inference on top of the Fairseq frame-
1104 work (Ott et al., 2019). Guidelines. We make available the full guide- 1133
lines used by the translators along with all other 1134
1105 COMET versions. We use models available
resources in the project repository. In short, the 1135
1106 in the official repository8 : wmt20-comet-da
annotators were presented with a source sentence 1136
1107 for COMET and wmt20-comet-qe-da-v2 for
and a model-generated hypothesis and asked to 1137
1108 COMET-QE.
deem that translation as correct (COR) or incorrect. 1138
1109 TokHal-Model. We follow the official imple- If incorrect, they were prompted to answer a se- 1139
1110 mentation.9 For the synthetic data generation step, ries of yes/no questions, regarding the presence 1140
1111 we used BART-large; and, for the token-level hal- of specific hallucinatory pathologies: oscillations 1141
1112 lucination predictor, we used XLM-R (Conneau (OSC), strong detachment (SD) and full detachment 1142
1113 et al., 2019). (FD). We also asked annotators to flag critical er- 1143
rors such as named-entity mistranslations (NE) and 1144
1114 B Data Collection under-generated translations (UG). 1145

1115 We perform a rigorous and comprehensive manual Inter-annotator agreement. To determine the 1146
1116 annotation procedure with professional translators, reliability of our annotations, we asked both our 1147
1117 in order to make sure that we are reliably analysing translators to annotate a set of 400 randomly sam- 1148
1118 hallucinated translations. First, the translators were pled translations. For all hallucinatory categories 1149
7
BLEU+case.mixed+lang.XX+numrefs.1+ but SD, the annotators achieved – according to Co- 1150
smooth.exp+tok.13a+version.1.4.2
8 hen’s kappa coefficient (Cohen, 1960) – almost 1151
https://github.com/Unbabel/COMET
9
https://github.com/violet-zct/fairseq-detect- perfect agreement. For all other categories, mod- 1152
hallucination erate to substantial agreement was obtained. This 1153

13
1154 confirms that our data conforms very well to our in- between interactions of different criteria for hallu- 1203
1155 structions. The agreement scores for each category cinations and for other translations: hallucinations 1204
1156 are displayed in Table 4. in our data are most often flagged by multiple cri- 1205
teria simultaneously (e.g. Seq-Logprob flags the 1206
1157 C An Analysis of the Dataset majority of hallucinations detected with Attn-ign- 1207
SRC and MC-DSim), whereas most MT errors and 1208
1158 C.1 High-Level Overview of the Dataset
correct translations are flagged by a single method. 1209
1159 Figure 1 gives a structured overview of dataset This difference in patterns supports our choice of 1210
1160 statistics. First, we see that while we picked trans- taxonomy: properties of hallucinations differ a lot 1211
1161 lations that are likely to be pathological, 60% of from other less severe errors. 1212
1162 the dataset is occupied by correct translations. This
1163 highlights that with the existing methods, finding D Analysis of TokHal-Model 1213
1164 poor translations reliably is still challenging. Next,
1165 note that most of the incorrect translations have TokHal-Model is unfit for natural hallucinations. 1214

1166 translation errors that are not severe enough to Let us recall that the model used for the TokHal- 1215

1167 be deemed hallucinations. This agrees with the Model scores was trained to identify replaced to- 1216

1168 view that hallucinations lie at the extreme end of kens in corrupted translations that are fluent and 1217

1169 the spectrum of MT pathologies (Raunak et al., do not differ much from the original ones (Zhou 1218

1170 2021). Finally, the results of the annotation con- et al., 2021). This means that during training, the 1219

1171 firm that our data selection is very reasonable. In- model was unlikely to observe highly pathological 1220

1172 deed, while previous work has reported halluci- translations that reflect the types of hallucinations 1221

1173 natory rates of 0.2 − 2% in in-domain settings, produced by actual NMT systems. This raises sev- 1222

1174 we see that, for a reasonably numbered collec- eral concerns when using this model in our setting. 1223

1175 tion of examples, our hallucination rate is substan- For example, it might incorrectly flag adequate to- 1224

1176 tially higher (9%). In Figure 1, we also show the kens such as synonyms or paraphrases. What is 1225

1177 method-specific statistics of human annotation re- more, since severely flawed examples are mostly 1226

1178 sults for each heuristic and quality filter. Unsur- out of distribution for the model, labels predicted 1227

1179 prisingly, the long tails of each method display for such translations may be unreasonable. 1228

1180 different characteristics. For example, almost all The results confirm our concerns: Figures 1 1229

1181 translations flagged by Attn-to-EOS are correct, and 5a show that (i) the vast majority of trans- 1230

1182 whereas the proportion of good translations flagged lations flagged by TokHal-Model are correct and 1231

1183 by COMET-QE or Seq-Logprob is rather small. We (ii) it is one of the worst hallucination detectors. 1232

1184 will analyse this further in Section 6. We also pro- On a separate note, in section 7 we will see that 1233

1185 vide additional insights on how hallucinations in it is reasonable for identifying mistranslations of 1234

1186 our data exhibit different characteristics to other named entities, which supports our hypothesis that 1235

1187 less severe MT errors in Appendix C.2. this method is more likely to detect less severe 1236
translation errors rather than hallucinations. 1237
1188 C.2 Hallucinations vs MT Errors
E Patterns of attention maps for 1238
1189 We look separately at the sets of examples with
translations flagged with 1239
1190 hallucinations and other translation errors. Figure 5
attention-based heuristics 1240
1191 shows the structures of these sets with respect to
1192 interactions of the different criteria. The attention maps are shown in Figure 6. While 1241
1193 We see that it is reasonable to consider hallu- both Attn-to-EOS and Attn-ign-SRC were de- 1242
1194 cinations separately from other errors: patterns signed to identify translations for which almost all 1243
1195 in Figures 5a and 5b are substantially different. the attention mass is concentrated on the EOS token, 1244
1196 For example, for translation errors COMET-QE the patterns identified with Attn-ign-SRC are more 1245
1197 performs well, being on par with reference-aware diverse. They span from attention mass coming to 1246
1198 COMET (Figure 5b). However, for hallucinations, various uninformative tokens (e.g., punctuation and 1247
1199 it identifies less than half the amount of examples other tokens as in Figure 6b) to examples shown in 1248
1200 identified not only by COMET, but even by sim- Figure 6c where attention is mostly diagonal (typ- 1249
1201 ple uncertainty-based heuristics (Figure 5a). Fur- ically, these correspond to undergenerations; we 1250
1202 thermore, Figure 5 reveals a significant difference follow this discussion in the next section). 1251

14
(a) Hallucinations (b) MT errors (c) Correct translations

Figure 5: Structure of the sets of translations flagged by the considered methods. Horizontal bars show the proportion
of examples flagged by each method among all translations of the considered category. Each vertical bar shows the
size for the set of translations that are (i) flagged by all the methods marked in the corresponding column and (ii) not
flagged by any of the rest; only the top intersections are shown. Quality filters are shown with diamond marks, and
detection heuristics – with circles. Methods requiring reference translations are shaded.

(a) Correct translation, flagged by (b) Hallucination, flagged by (c) Undergeneration, flagged by
Attn-to-EOS Attn-ign-SRC Attn-ign-SRC

Figure 6: Examples of attention maps flagged by attention-based heuristics.

1252 F Analysing Less Severe Translation G Overwriting Hallucinations with 1271


1253 Errors Seq-Logprob as the scoring measure 1272

The results are shown in Figure 7. In order to 1273


use Seq-Logprob as the scoring measure, we score 1274
1254 We notice that some of the detection methods are all generated hypothesis with the original model. 1275
1255 specialised on specific pathologies. For example, Overall, although the results follow the same trend 1276
1256 Attn-ign-SRC is by far the best for detecting un- as those using COMET-QE as the scoring measure, 1277
1257 degenerations. This is expected: an undergener- they are slightly worse: the hallucinatory rate is 1278
1258 ation does not cover part of the source sentence, higher and the percentage of correct translations is 1279
1259 thus a significant proportion of source tokens re- smaller. 1280
1260 ceives little attention mass (see example in Fig-
1261 ure 6c). For named entity errors, the best heuristic H Examples of overwritten translations 1281
1262 is TokHal-Model. This mirrors the discussion in Table 5 shows examples of each hallucination type 1282
1263 Section D: while severe errors (i.e., hallucinations) that have been overwritten with correct translations 1283
1264 fall out of distribution for this model, mistransla- with the approach described in Section 8. 1284
1265 tions of short phrases are more in line with the
1266 model’s training. I Practical Recommendations for 1285
Detection 1286
1267 On a different note, MC-DSim performs much
1268 better for hallucinations than for less severe As we have mentioned in Section 3.4, Binary-label 1287
1269 pathologies (Figure 3b). This again points to hallu- heuristics output a value in {0; 1}, and continuous- 1288
1270 cinations being more unstable than other errors. score heuristics output a value s in R. These values 1289

15
Be careful when relying on references. Using 1328
reference information is helpful for detecting hallu- 1329
cinations (see Section 6.1), and while it may not be 1330
used to detect hallucinatins on-the-fly, it may still 1331
prove useful for analysis works. We have found that 1332
high-quality parallel data is critical for adequate ap- 1333
plication of these methods: very low scores might 1334
not only be attributed to poor translations, but also 1335
Figure 7: Our pipeline scheme along with results when to reference mismatches. Indeed, preliminary ex- 1336
we use Seq-Logprob as both the detector and the scoring periments highlighted this worrying trend, which 1337
measure. motivated us to clean the held-out set (Section 4). 1338
Thus, if using reference information to detect hal- 1339
lucinations, make sure to thoroughly clean your 1340
1290 s can be used to build a binary decision rule: for
parallel data. 1341
1291 a given source x, translation ŷ and reference y, a
1292 translation is flagged by the detector if and only
1293 if s(x, ŷ, y) ≤ γ. Naturally, s may not need be a
1294 function of x, ŷ and y (e.g. COMET-QE is only a
1295 function of x and ŷ).

1296 Choosing the thresholds. In our work, we chose


1297 the thresholds γ for each detector by assessing the
1298 value s correspondent to approximately the 0.4-th
1299 percentile to be consistent with the data selection
1300 process (see Section 5). We computed these thresh-
1301 olds using the entire filtered held-out data. How-
1302 ever, in practice, we obtained very similar results
1303 when we obtained these thresholds using a collec-
1304 tion of only 10 000 examples from that dataset.
1305 We recommend computing these thresholds on
1306 in-domain clean data. This will guarantee that the
1307 cut-off values were obtained in a scenario where
1308 the model performs best. Finally, the definition of
1309 the k-th percentile is expected to have an impact on
1310 the precision-recall trade-off. Thus, for preventive
1311 settings, we recommend sticking to more conserva-
1312 tive values of k.

1313 Choosing the detectors. The choice of detector


1314 rests upon the application for which it is intended.
1315 For high-precision settings, binary-label heuristics
1316 such as TNG and variants thereof (Raunak et al.,
1317 2022) may be more recommended. For preventive
1318 settings, we suggest using Seq-Logprob as the back-
1319 bone of the hallucination detector. Naturally, we
1320 generally obtain higher recall by joining the sets
1321 of translations flagged with multiple methods. For
1322 example, Figure 3 reveals that joining the set of
1323 translations flagged with Seq-Logprob with the set
1324 of translations flagged with COMET-QE is very
1325 reasonable: COMET-QE performs better for oscil-
1326 latory hallucinations, while Seq-Logprob is better
1327 for the other hallucination types.

16
OVERWRITING FULLY DETACHED HALLUCINATIONS
S OURCE Handys, die bis auf Wäschewaschen und Staubsaugen scheinbar alles können.
R EFERENCE Mobile phones that can practically do everything except clean the laundry and vacuum clean.
O RIGINAL
The staff were very friendly and helpful.
H YPOTHESIS
OVERWRITTEN
Mobile phones that seem to be able to do everything except on laundry and dustproofing.
H YPOTHESIS

S OURCE In unserem 2 Personen Van mit Dusche/WC war ausreichend Platz für uns beide.
R EFERENCE The space in our 2 person van with shower/toilet was enough for 2 people.
O RIGINAL
The staff were very friendly and helpful. The room was clean and comfortable.
H YPOTHESIS
OVERWRITTEN
In our 2 person van with shower/WC there was enough room for us both.
H YPOTHESIS

OVERWRITING STRONGLY DETACHED HALLUCINATIONS


S OURCE Tickets für Busse und die U-Bahn ist zu teuer, vor allem in Stockholm.
R EFERENCE Tickets for buses and the subway is too expensive, especially in Stockholm.
O RIGINAL
The hotel is located in the centre of Stockholm, close to the train station.
H YPOTHESIS
OVERWRITTEN
Buses and metro tickets are too expensive, especially in Stockholm.
H YPOTHESIS

S OURCE Ich freue mich über jeden Tag, wo es mir so gut geht!
R EFERENCE I am pleased about each day, where I am so well!
O RIGINAL
I look forward to seeing you every day!
H YPOTHESIS
OVERWRITTEN
I’m very happy with every day I’m doing so well!
H YPOTHESIS

OVERWRITING OSCILLATORY HALLUCINATIONS


In dieser Zeit stürzt sich Murnau bereits wieder in Theaterproben, allerdings widmet er sich nicht mehr
S OURCE
der Schauspielerei, sondern der Regie.
During this period, Murnau once again dedicates his time to theatre rehearsals; however, this time not
R EFERENCE
as an actor, but as a director.
O RIGINAL
Murnau was born in Murnau, Germany. He was born in Murnau, Germany.
H YPOTHESIS
OVERWRITTEN During this time, Murnau began to appear in theater tests again, but he was no longer concerned with
H YPOTHESIS acting, but with directing.

Die Staaten, deren Fangflotten im Nordwestatlantik Hochseefischerei betreiben, bemühen sich im


S OURCE Rahmen der NAFO (Organisation für die Fischerei im Nordwestatlantik) um eine gemeinsame Be-
standserhaltung und - bewirtschaftung.
States which fish in the high seas in the North West Atlantic co-operate in NAFO (North-west Atlantic
R EFERENCE
Fisheries Organisation) in order to ensure conservation and management of stocks.
O RIGINAL The North-West Atlantic Fisheries Organisation (NAFO) is a member of the North-West Atlantic
H YPOTHESIS Fisheries Organisation (NAFO).
The states whose fishing fleets in the North-West Atlantic are engaged in deep-sea fishing are seeking
OVERWRITTEN
joint conservation and management within the framework of the North-West Atlantic Fisheries Organi-
H YPOTHESIS
sation (NAFO).

Table 5: Examples of hallucinations of each type that have been overwritten with correct translations.

17

You might also like