Professional Documents
Culture Documents
Abstract
Table 2: Our synthetically-noisy data have similar CER Table 3: AER comparison for Griko-Italian. Giza++
compared to the real OCR outputs, which implies that performs best on both settings, but it exhibits the largest
the real OCRed noisy data can be mimicked by our drop in performance.
noise model.
Table 2 shows the CER comparison between our • Giza++ (Och, 2003): a popular statistical align-
synthetic data and real OCR data. ment model that is based on a pipeline of IBM
and Hidden Markov models (Vogel et al., 1996).
4.3 Test Set and Gold Alignment
• fast-align (Dyer et al., 2013): a simple but
The test set and gold alignment for English-French effective statistical word alignment model that is
both come from Mihalcea and Pedersen (2003). based on IBM Model 2, with an additional bias
For the English-German, the test set comes from towards monotone alignment.
Europarl v7 corpus (Koehn, 2005) and the gold • awesome-align (Dou and Neubig, 2021): a neu-
alignments are provided by Vilar et al. (2006). To ral word alignment model based on mBERT. It fine-
study the effect of OCR-like errors on alignment tunes a pre-trained multilingual language model
quality, we create synthetically-noised test sets for with parallel text and extracts the alignments
both languages pairs by applying noise on one side from the resulting representations.
or both, which results in four copies of the same
test set: clean-clean, clean-noisy, noisy-clean, and 5.2 The Effect of OCR-like Noise
noisy-noisy. We use Griko-Italian as our main evaluation pair
For low-resource language pairs, Rijhwani et al. due to the presence of its gold alignments, which
(2020) provide about 800 parallel sentence pairs can most accurately reflect the model’s perfor-
for each. We use the first 300 sentence pairs as our mance under a low-resource scenario.
test sets. For the purpose of fair evaluation in our We first benchmark model performance on clean
method, we manually create a total of 4,101 gold and OCRed parallel text to quantify OCR-error ef-
word-level alignment pairs for Griko-Italian test fects on alignment (Table 3). We compute AER
set. On the other hand, we obtain silver alignments for the clean and OCRed versions of Griko-Italian
from awesome-align for Ainu-Japanese as there by comparing their alignment against our manually
is no existing gold alignment for this language pair created gold alignment. We benchmark five differ-
available.9 ent models that lead to several observations. First,
note that clean text always results in a better align-
5 Experiments
ment for all models. Overall, Giza++ performs best
In this section we present multiple experiments among the models, but note that it also suffers the
and show that our method results in significantly largest drop in performance when faced with noisy
reducing AER. text. On the other hand, a vanilla awesome-align,
which is otherwise a state-of-the-art model for lan-
5.1 Experimental Setup guages that were included in the pre-training of its
Models We study the following alignment mod- underlying model, performs the worst, not being
els: different than a simple IBM 1.
• IBM model 1&2 (Brown et al., 1993): the classic We can thus conclude that OCR error does im-
statistical word alignment models. They under- pact alignment quality for both statistical and neu-
pinned many other statistical machine translation ral based alignment models.
and word alignment models. It is of note that for Griko-Italian every statisti-
9
While not ideal, we can still measure how different results cal model outperforms awesome-align in almost
are when comparing alignments on clean versus noisy data. all cases. We hypothesize that this is due to the
Griko-Italian Ainu-Japanese
Clean OCRed Clean OCRed
BASE 45.1 48.8 28.2 29.2
UNSUP - FT (A) 23.2±1.6 28±1.1 21.1 ±1 22.2±1.2
SUP - FT (B) 22.3±2 26.6±1.1 30.9±2.1 31.5±1.2
+ diagonal bias
UNSUP - FT (A) 18.7±1 24.2±0.6 15.3±3.9 13.8±4.2
SUP - FT (B) 18.2±2.6 22.9±2.4 26.3 ±2.1 27.4±1.8
AER reduction 59.6% 53.1% 45.7% 52.7%
Table 4: For both endangered languages, our approach greatly improves both clean and OCRed alignments.
Table 5: Result of awesome-align on English to French and German alignment. Both unsupervised and supervised
finetuning with noise-induced data leads to big AER improvements when aligning noisy data. Improvements in
German are less pronounced. Unsupervised finetuning with noisy data also improves clean-data alignment.
Table 6: Experiment on Griko-Italian, every statistical model benefits from training with additional clean data but
suffers significant performance drops with synthetic noisy data, suggesting that traditional statistical models rely
on clean text.
ments by creating several English-French synthetic and diagonal biasing methods, could indeed be the
data with varying degrees of CER, testing them best option.
on awesome-align. We elaborate on the process
Different side of OCR noise An important take-
and results in Appendix B.2. The main finding is
away from Table 5 is that when both sides of the
that higher CER leads to greater AER, which is
parallel data are noisy, awesome-align’s perfor-
expected. However, we also find that mixing with
mance degrades a lot more than when only one of
different degrees of CER generally produces better
the two sides is noisy, for both unsupervised and su-
results than a fixed CER throughout the corpus, sug-
pervised setting. This is in fact encouraging for our
gesting that our augmentation approach could also
envisioned application scenarios, since, as in the
work on the unknown CER real-world scenario.
AILLA examples described above, we expect that
Statistical Model vs Neural Model The ques- OCRed parallel data in endangered languages will
tion of which model to use in practical scenarios, come with one side in a high-resource standardized
though, remains tricky to answer. Due to simi- language like English and Spanish which in turn
larities between Griko and Italian and prolonged we expect the OCR model to be able to adequately
language contact over centuries, the two languages handle.13
follow very similar syntax; as a result, their align-
ment is largely monotone, which benefits models
7 Related Work
like Giza++ and fast-align. They outperform, Our work is a natural extension of previous word
in fact, the vanilla neural awesome-align model alignment work. A robust alignment tool for low-
by a large margin (see Table 3). However, this resource languages benefits MT systems (Xiang
will not always be the case. For example, most et al., 2010a; Levinboim and Chiang, 2015; Be-
books with parallel data in the Archive of Indige- loucif et al., 2016a; Nagata et al., 2020), or speech
nous Languages of Latin America (AILLA) mostly recognition (Anastasopoulos and Chiang, 2018),
contain data between indigenous languages and especially if sentence-level alignment tools like
one of Spanish or English. Now the two sides LASER (Artetxe and Schwenk, 2019; Chaudhary
of the data come from different language families et al., 2019) do not cover all languages, so one
and a monotone alignment is not necessarily to may need to fall-back to word-level alignment
be expected. In such cases, it could indeed be 13
Rijhwani et al. (2020) and Rijhwani et al. (2021) make
the case that a more adaptable neural model like similar observations on all endangered language datasets they
awesome-align, aided by our data augmentation work with.
Test Set
Unsupervised-Finetuning synth-synth Supervised-Finetuning
53.6 52.6 51.9 53.6
49.4 clean-synth 47.7 47.3
47.7 46.9 46.5 46.1 46.5
40.5 39.7 39.2 synth-clean 40.538.2 40.3
38.3 37.3 37 36.7 36.3 37.6 36.8 37.9 37.3
40 clean-clean 40
39.2 31.1 29.2 29.4
AER
heuristics to inform sentence-alignment models However, to the best of our knowledge, we are the
like Hunalign (Varga et al., 2007). first to apply it to low-resource languages, prov-
Research on word-level alignment started with ing that such an approach can greatly aid the real
statistical models, with the IBM Translation Mod- endangered language data.
els (Brown et al., 1993) serving as the foundation
for many popular statistical word aligners (Och 8 Conclusion
and Ney, 2000, 2003; Och, 2003; Tiedemann In this work, we benchmark several popular word
et al., 2016; Vogel et al., 1996; Och, 2003; Gao alignment models under OCR noisy settings with
and Vogel, 2008; Dyer et al., 2013). In recent high- and low-resource language pairs, conduct-
years, different neural network based alignment ing several studies to investigate the relationship
models gained in popularity including end-to-end between OCR noise and alignment quality. We
based (Zenkel et al., 2020; Wu et al., 2022; Chen propose a simple yet effective approach to cre-
et al., 2021), MT-based (Chen et al., 2020), and ate realistic OCR-like synthetic data and make the
pre-training based (Garg et al., 2019; Dou and state-of-the-art neural awesome-align model more
Neubig, 2021). As awesome-align achieves the robust by leveraging structural bias. Our work
overall highest performance, we choose to focus paves the way for future word-level alignment-
on awesome-align in this work. related research on underrepresented languages.
Some works involve improving word-level align- As part of this paper, we also release a total of
ment for low-resource languages such as utilizing 4,101 ground truth word alignment data for Griko-
semantic information (Beloucif et al., 2016b; Pour- Italian,14 which can be a useful resource to investi-
damghani et al., 2018), multi-task learning (Lev- gate word- and sentence-level alignment techniques
inboim and Chiang, 2015), and combining com- on practical endangered language scenarios.
plementary word alignments (Xiang et al., 2010b).
None of the previous work, though, to our knowl- 9 Limitations
edge, tackles the problem of aligning data with
Using AER as the main evaluation metric could be
OCR-like noise on one or both sides. The idea of
a limitation of our work as it might be misleading
augmenting training data is not new and has been
in some cases (Fraser and Marcu, 2007). Another
applied in many areas and applications. Marton
limitation, of course, is that we only manage to
et al. (2009) augment data with paraphrases taken
explore the tip of the iceberg given the sheer num-
from other languages to improve low-resource lan-
ber of endangered languages. While we are confi-
guage alignments. While potentially orthogonal to
dent in the results of both low-resource language
our approach, this idea is largely inapplicable to
pairs, our experiments on Ainu-Japenese could po-
our endangered language settings, as we often have
tentially lead to inaccurate AER since we use the
to work with the only available datasets for these
automatically generated silver alignment. In the
particular languages. Applying structure alignment
future, we hope to eventually annotate it with either
bias on statistical and neural models is also a well-
the help of native speakers or dictionaries. We also
studied area (Cohn et al., 2016; Brown et al., 1993;
14
Vogel et al., 1996; Och, 2003; Dyer et al., 2013). To be made publicly available with our paper.
plan to explore other alternative metrics and expand Vishrav Chaudhary, Yuqing Tang, Francisco Guzmán,
our alignment benchmark on as many endangered Holger Schwenk, and Philipp Koehn. 2019. Low-
resource corpus filtering using multilingual sentence
languages as possible.
embeddings. WMT 2019, page 263.
Table 8: Comparing the number of alignment pairs produced by models on Griko-Italian. awesome-align pro-
duces almost 25% less alignment pairs, resulting in markedly lower precision/recall and higher AER.
Table 9: AER comparison for varying CER in 100K English-French augmented data used for either unsupervised
or supervised finetuning. We highlight the best result under each setting and test set. Overall, most models’
performance is close to the baseline, but varying amounts of noise (mixed) lead to generally the best results. Too
high amounts of noise (e.g. CER around 10 with WER approaching 70) hurts the supervised approach.