Professional Documents
Culture Documents
Transferring BERT Capabilities From High-Resource To Low-Resource Languages Using Vocabulary Matching
Transferring BERT Capabilities From High-Resource To Low-Resource Languages Using Vocabulary Matching
for low-resource languages, thus democratizing access to advanced language understanding models.
3 HerBERT 92.00 79.72 Steven Cao, Nikita Kitaev, and Dan Klein. 2020.
Multilingual alignment of contextual word repre-
4 Matched HerBERT 96.00 81.79 sentations. In International Conference on Learn-
ing Representations.
Table 3: The performance of the passage retrieval
task for Silesian. HerBERT refers to the Her- Alexis Conneau, Kartikay Khandelwal, Naman
BERT model fine-tuned to the Silesian corpus and Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Matched HerBERT refers to the model that uses Francisco Guzmán, Edouard Grave, Myle Ott,
vocabulary matching (see Section 3.2). Luke Zettlemoyer, and Veselin Stoyanov. 2020.
Unsupervised cross-lingual representation learn-
ing at scale. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin-
4.2.2. Results guistics, pages 8440–8451, Online. Association
The E5 models score 80% for accuracy and 74.79% for Computational Linguistics.
for NDCG. Silver Retriever achieves comparable
results, it scores higher for accuracy (84%), but Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
lower for NDCG (73.59%). Kristina Toutanova. 2019. BERT: Pre-training
Both fine-tuned models outperform the zero-shot of deep bidirectional transformers for language
retrievers, but the HerBERT model with matched understanding. In Proceedings of the 2019 Con-
vocabulary achieves much higher results, 96% vs ference of the North American Chapter of the As-
92% for accuracy and 81.79% vs 79.72% for NDCG. sociation for Computational Linguistics: Human
It proves the effectiveness of the vocabulary- Language Technologies, Volume 1 (Long and
matching technique. Short Papers), pages 4171–4186, Minneapolis,
Minnesota. Association for Computational Lin-
guistics.
5. Conclusion Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen
In this paper, we propose a simple method to Arivazhagan, and Wei Wang. 2022. Language-
extend the capabilities of pre-trained language agnostic BERT sentence embedding. In Pro-
models, in particular BERT, to low-resource lan- ceedings of the 60th Annual Meeting of the As-
guages through vocabulary matching. Experiments sociation for Computational Linguistics (Volume
conducted on Silesian and Kashubian languages 1: Long Papers), pages 878–891, Dublin, Ireland.
demonstrate the effectiveness of the proposed Association for Computational Linguistics.
method on masked word prediction and passage
Luyu Gao, Xueguang Ma, Jimmy J. Lin, and Jamie
retrieval tasks.
Callan. 2022. Tevatron: An efficient and flexible
The contributions of this work include the publica- toolkit for dense retrieval.
tion of the first BERT models for the Silesian and
Kashubian languages, marking a significant ad- Luke Gessler and Amir Zeldes. 2022. MicroBERT:
vance in linguistic resources for these languages. Effective training of low-resource monolingual
BERTs through parameter reduction and mul- Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019.
titask learning. In Proceedings of the The 2nd Massively multilingual transfer for NER. In Pro-
Workshop on Multi-lingual Representation Learn- ceedings of the 57th Annual Meeting of the As-
ing (MRL), pages 86–99, Abu Dhabi, United Arab sociation for Computational Linguistics, pages
Emirates (Hybrid). Association for Computational 151–164, Florence, Italy. Association for Compu-
Linguistics. tational Linguistics.
Viktor Hangya, Hossain Shaikh Saadi, and Alexan- Piotr Rybak. 2023. MAUPQA: Massive
der Fraser. 2022. Improving low-resource lan- automatically-created Polish question an-
guages in pre-trained multilingual language mod- swering dataset. In Proceedings of the
els. In Proceedings of the 2022 Conference on 9th Workshop on Slavic Natural Language
Empirical Methods in Natural Language Process- Processing 2023 (SlavicNLP 2023), pages
ing, pages 11993–12006, Abu Dhabi, United 11–16, Dubrovnik, Croatia. Association for
Arab Emirates. Association for Computational Computational Linguistics.
Linguistics. Piotr Rybak and Maciej Ogrodniczuk. 2023. Silver
retriever: Advancing neural passage retrieval for
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cu- polish question answering.
mulated gain-based evaluation of ir techniques.
ACM Trans. Inf. Syst., 20:422–446. Piotr Rybak, Piotr Przybyła, and Maciej Ogrod-
niczuk. 2022. Polqa: Polish question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, dataset.
Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi
Chen, and Wen-tau Yih. 2020. Dense passage Tal Schuster, Ori Ram, Regina Barzilay, and Amir
retrieval for open-domain question answering. In Globerson. 2019. Cross-lingual alignment of con-
Proceedings of the 2020 Conference on Empir- textual word embeddings, with applications to
ical Methods in Natural Language Processing zero-shot dependency parsing. In Proceedings
(EMNLP), pages 6769–6781, Online. Associa- of the 2019 Conference of the North American
tion for Computational Linguistics. Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Vol-
Guillaume Lample and Alexis Conneau. 2019. ume 1 (Long and Short Papers), pages 1599–
Cross-lingual language model pretraining. Ad- 1613, Minneapolis, Minnesota. Association for
vances in Neural Information Processing Sys- Computational Linguistics.
tems (NeurIPS). Ke Tran. 2020. From english to foreign languages:
Transferring pre-trained language models.
Anne Lauscher, Vinit Ravishankar, Ivan Vulić, and
Goran Glavaš. 2020. From zero to hero: On the Liang Wang, Nan Yang, Xiaolong Huang, Binx-
limitations of zero-shot language transfer with ing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma-
multilingual Transformers. In Proceedings of the jumder, and Furu Wei. 2022a. Text embeddings
2020 Conference on Empirical Methods in Natu- by weakly-supervised contrastive pre-training.
ral Language Processing (EMNLP), pages 4483– arXiv preprint arXiv:2212.03533.
4499, Online. Association for Computational Lin-
guistics. Xinyi Wang, Sebastian Ruder, and Graham Neubig.
2022b. Expanding pretrained models to thou-
Michał Marcińczuk, Adam Radziszewski, Maciej Pi- sands more languages via lexicon-based adap-
asecki, Dominik Piasecki, and Marcin Ptak. 2013. tation. In Proceedings of the 60th Annual Meet-
Evaluation of baseline information retrieval for ing of the Association for Computational Linguis-
Polish open-domain question answering system. tics (Volume 1: Long Papers), pages 863–877,
In Proceedings of the International Conference Dublin, Ireland. Association for Computational
Recent Advances in Natural Language Process- Linguistics.
ing RANLP 2013, pages 428–435, Hissar, Bul- Zihan Wang, Karthikeyan K, Stephen Mayhew, and
garia. INCOMA Ltd. Shoumen, BULGARIA. Dan Roth. 2020. Extending multilingual BERT to
low-resource languages. In Findings of the As-
Robert Mroczkowski, Piotr Rybak, Alina sociation for Computational Linguistics: EMNLP
Wróblewska, and Ireneusz Gawlik. 2021. 2020, pages 2649–2656, Online. Association for
HerBERT: Efficiently pretrained transformer- Computational Linguistics.
based language model for Polish. In Proceedings
of the 8th Workshop on Balto-Slavic Natural Lan- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
guage Processing, pages 1–10, Kiyv, Ukraine. Chaumond, Clement Delangue, Anthony Moi,
Association for Computational Linguistics. Pierric Cistac, Tim Rault, Remi Louf, Morgan
Funtowicz, Joe Davison, Sam Shleifer, Patrick
von Platen, Clara Ma, Yacine Jernite, Julien Plu,
Canwen Xu, Teven Le Scao, Sylvain Gugger,
Mariama Drame, Quentin Lhoest, and Alexan-
der Rush. 2020. Transformers: State-of-the-art
natural language processing. In Proceedings of
the 2020 Conference on Empirical Methods in
Natural Language Processing: System Demon-
strations, pages 38–45, Online. Association for
Computational Linguistics.
Shijie Wu and Mark Dredze. 2020. Are all lan-
guages created equal in multilingual BERT? In
Proceedings of the 5th Workshop on Represen-
tation Learning for NLP, pages 120–130, Online.
Association for Computational Linguistics.