Professional Documents
Culture Documents
Louis Martin∗1,2,3 Benjamin Muller∗2,3 Pedro Javier Ortiz Suárez∗2,3 Yoann Dupont3
Laurent Romary2 Éric Villemonte de la Clergerie2 Djamé Seddah2 Benoît Sagot2
1
Facebook AI Research, Paris, France
2
Inria, Paris, France
3
Sorbonne Université, Paris, France
louismartin@fb.com
{benjamin.muller, pedro.ortiz, laurent.romary,
eric.de_la_clergerie, djame.seddah, benoit.sagot}@inria.fr
yoa.dupont@gmail.com
2.1 Contextual Language Models BERT and RoBERTa Our approach is based on
From non-contextual to contextual word em- RoBERTa (Liu et al., 2019) which itself is based on
beddings The first neural word vector repre- BERT (Devlin et al., 2019). BERT is a multi-layer
sentations were non-contextualized word embed- bidirectional Transformer encoder trained with a
dings, most notably word2vec (Mikolov et al., masked language modeling (MLM) objective, in-
2013), GloVe (Pennington et al., 2014) and fastText spired by the Cloze task (Taylor, 1953). It comes
(Mikolov et al., 2018), which were designed to be in two sizes: the BERTBASE architecture and the
used as input to task-specific neural architectures. BERTLARGE architecture. The BERTBASE architec-
Contextualized word representations such as ELMo ture is 3 times smaller and therefore faster and
(Peters et al., 2018) and flair (Akbik et al., 2018), easier to use while BERTLARGE achieves increased
improved the representational power of word em- performance on downstream tasks. RoBERTa im-
beddings by taking context into account. Among proves the original implementation of BERT by
other reasons, they improved the performance of identifying key design choices for better perfor-
models on many tasks by handling words poly- mance, using dynamic masking, removing the
semy. This paved the way for larger contextualized next sentence prediction task, training with larger
models that replaced downstream architectures alto- batches, on more data, and for longer.
gether in most tasks. Trained with language model- 3 Downstream evaluation tasks
ing objectives, these approaches range from LSTM-
based architectures such as (Dai and Le, 2015), In this section, we present the four downstream
to the successful transformer-based architectures tasks that we use to evaluate CamemBERT, namely:
such as GPT2 (Radford et al., 2019), BERT (Devlin Part-Of-Speech (POS) tagging, dependency pars-
et al., 2019), RoBERTa (Liu et al., 2019) and more ing, Named Entity Recognition (NER) and Natural
1 2
Released at: https://camembert-model.fr https://allennlp.org/elmo
Language Inference (NLI). We also present the 2018). NLI consists in predicting whether a hypoth-
baselines that we will use for comparison. esis sentence is entailed, neutral or contradicts a
premise sentence. The XNLI dataset is the exten-
Tasks POS tagging is a low-level syntactic task, sion of the Multi-Genre NLI (MultiNLI) corpus
which consists in assigning to each word its corre- (Williams et al., 2018) to 15 languages by trans-
sponding grammatical category. Dependency pars- lating the validation and test sets manually into
ing consists in predicting the labeled syntactic tree each of those languages. The English training set
in order to capture the syntactic relations between is machine translated for all languages other than
words. English. The dataset is composed of 122k train,
For both of these tasks we run our experiments 2490 development and 5010 test examples for each
using the Universal Dependencies (UD)3 frame- language. As usual, NLI performance is evaluated
work and its corresponding UD POS tag set (Petrov using accuracy.
et al., 2012) and UD treebank collection (Nivre
et al., 2018), which was used for the CoNLL 2018 Baselines In dependency parsing and POS-
shared task (Seker et al., 2018). We perform our tagging we compare our model with:
evaluations on the four freely available French
UD treebanks in UD v2.2: GSD (McDonald et al., • mBERT: The multilingual cased version of
2013), Sequoia4 (Candito and Seddah, 2012; Can- BERT (see Section 2.1). We fine-tune mBERT
dito et al., 2014), Spoken (Lacheret et al., 2014; on each of the treebanks with an additional
Bawden et al., 2014)5 , and ParTUT (Sanguinetti layer for POS-tagging and dependency pars-
and Bosco, 2015). A brief overview of the size and ing, in the same conditions as our Camem-
content of each treebank can be found in Table 1. BERT model.
Table 2: POS and dependency parsing scores on 4 French treebanks, reported on test sets assuming gold tokeniza-
tion and segmentation (best model selected on validation out of 4). Best scores in bold, second best underlined.
Table 5: Results on the four tasks using language models pre-trained on data sets of varying homogeneity and size,
reported on validation sets (average of 4 runs for POS tagging, parsing and NER, average of 10 runs for NLI).
Corpus Size #tokens #docs Tokens/doc 6.2 How much data do you need?
Percentiles:
5% 50% 95% An unexpected outcome of our experiments is that
Wikipedia 4GB 990M 1.4M 102 363 2530 the model trained “only” on the 4GB sample of OS-
CCNet 135GB 31.9B 33.1M 128 414 2869
OSCAR 138GB 32.7B 59.4M 28 201 1946 CAR performs similarly to the standard Camem-
BERT trained on the whole 138GB OSCAR. The
Table 6: Statistics on the pretraining datasets used. only task with a large performance gap is NER,
where “138GB” models are better by 0.9 F1 points.
This could be due to the higher number of named
also provides us a way to investigate the impact entities present in the larger corpora, which is ben-
of pretraining data size. Downstream task perfor- eficial for this task. On the contrary, other tasks
mance for our alternative versions of CamemBERT don’t seem to gain from the additional data.
are provided in Table 5. The upper section reports In other words, when trained on corpora such
scores in the fine-tuning setting while the lower as OSCAR and CCNet, which are heterogeneous
section reports scores for the embeddings. in terms of genre and style, 4GB of uncompressed
text is large enough as pretraining corpus to reach
6.1 Common Crawl vs. Wikipedia? state-of-the-art results with the BASE architecure,
better than those obtained with mBERT (pretrained
Table 5 clearly shows that models trained on the on 60GB of text).17 This calls into question the
4GB versions of OSCAR and CCNet (Common need to use a very large corpus such as OSCAR or
Crawl) perform consistently better than the the one CCNet when training a monolingual Transformer-
trained on the French Wikipedia. This is true both based language model such as BERT or RoBERTa.
in the fine-tuning and embeddings setting. Unsur- Not only does this mean that the computational
prisingly, the gap is larger on tasks involving texts (and therefore environmental) cost of training a
whose genre and style are more divergent from state-of-the-art language model can be reduced, but
those of Wikipedia, such as tagging and parsing it also means that CamemBERT-like models can
on the Spoken treebank. The performance gap is be trained for all languages for which a Common-
also very large on the XNLI task, probably as a Crawl-based corpus of 4GB or more can be created.
consequence of the larger diversity of Common- OSCAR is available in 166 languages, and pro-
Crawl-based corpora in terms of genres and topics. vides such a corpus for 38 languages. Moreover, it
XNLI is indeed based on multiNLI which covers a is possible that slightly smaller corpora (e.g. down
range of genres of spoken and written text. to 1GB) could also prove sufficient to train high-
The downstream task performances of the mod- performing language models. We obtained our re-
els trained on the 4GB version of CCNet and OS- sults with BASE architectures. Further research is
CAR are much more similar.16 needed to confirm the validity of our findings on
larger architectures and other more complex natural
16 17
We provide the results of a model trained on the whole The OSCAR-4GB model gets slightly better XNLI accu-
CCNet corpus in the Appendix. The conclusions are similar racy than the full OSCAR-138GB model (81.88 vs. 81.55).
when comparing models trained on the full corpora: down- This might be due to the random seed used for pretraining, as
stream results are similar when using OSCAR or CCNet. each model is pretrained only once.
language understanding tasks. However, even with guages other than English. Using French as an
a BASE architecture and 4GB of training data, the example, we trained CamemBERT, a language
validation loss is still decreasing beyond 100k steps model based on RoBERTa. We evaluated Camem-
(and 400 epochs). This suggests that we are still BERT on four downstream tasks (part-of-speech
under-fitting the 4GB pretraining dataset, training tagging, dependency parsing, named entity recog-
longer might increase downstream performance. nition and natural language inference) in which our
best model reached or improved the state of the
7 Discussion art in all tasks considered, even when compared to
Since the pre-publication of this work (Martin et al., strong multilingual models such as mBERT, XLM
2019), many monolingual language models have and XLM-R, while also having fewer parameters.
appeared, e.g. (Le et al., 2019; Virtanen et al., 2019; Our experiments demonstrate that using web
Delobelle et al., 2020), for as much as 30 languages crawled data with high variability is preferable
(Nozza et al., 2020). In almost all tested config- to using Wikipedia-based data. In addition we
urations they displayed better results than multi- showed that our models could reach surprisingly
lingual language models such as mBERT (Pires high performances with as low as 4GB of pretrain-
et al., 2019). Interestingly, Le et al. (2019) showed ing data, questioning thus the need for large scale
that using their FlauBert, a RoBERTa-based lan- pretraining corpora. This shows that state-of-the-art
guage model for French, which was trained on less Transformer-based language models can be trained
but more edited data, in conjunction to Camem- on languages with far fewer resources than En-
BERT in an ensemble system could improve the glish, whenever a few gigabytes of data are avail-
performance of a parsing model and establish a new able. This paves the way for the rise of monolin-
state-of-the-art in constituency parsing of French, gual contextual pre-trained language-models for
highlighting thus the complementarity of both mod- under-resourced languages. The question of know-
els.18 As it was the case for English when BERT ing whether pretraining on small domain specific
was first released, the availability of similar scale content will be a better option than transfer learn-
language models for French enabled interesting ing techniques such as fine-tuning remains open
applications, such as large scale anonymization and we leave it for future work.
of legal texts, where CamemBERT-based mod- Pretrained on pure open-source corpora, Camem-
els established a new state-of-the-art on this task BERT is freely available and distributed with the
(Benesty, 2019), or the first large question answer- MIT license via popular NLP libraries (fairseq
ing experiments on a French Squad data set that and huggingface) as well as on our website
was released very recently (d’Hoffschmidt et al., camembert-model.fr.
2020) where the authors matched human perfor-
mance using CamemBERTLARGE . Being the first Acknowledgments
pre-trained language model that used the open-
source Common Crawl Oscar corpus and given We want to thank Clémentine Fourrier for her proof-
its impact on the community, CamemBERT paved reading and insightful comments, and Alix Chagué
the way for many works on monolingual language for her great logo. This work was partly funded by
models that followed. Furthermore, the availability three French National funded projects granted to
of all its training data favors reproducibility and Inria and other partners by the Agence Nationale de
is a step towards better understanding such mod- la Recherche, namely projects PARSITI (ANR-16-
els. In that spirit, we make the models used in our CE33-0021), SoSweet (ANR-15-CE38-0011) and
experiments available via our website and via the BASNUM (ANR-18-CE38-0003), as well as by the
huggingface and fairseq APIs, in addition last author’s chair in the PRAIRIE institute funded
to the base CamemBERT model. by the French national agency ANR as part of the
“Investissements d’avenir” programme under the
8 Conclusion reference ANR-19-P3IA-0001.
In this work, we investigated the feasibility of train-
ing a Transformer-based language model for lan-
18
We refer the reader to (Le et al., 2019) for a comprehen-
sive benchmark and details therein.
References tion pour l’adaptation d’analyseur par pont lex-
ical (the sequoia corpus : Syntactic annotation
Anne Abeillé, Lionel Clément, and François Tou-
and use for a parser lexical domain adaptation
ssenel. 2003. Building a Treebank for French,
method) [in french]. In Proceedings of the Joint
pages 165–187. Kluwer, Dordrecht.
Conference JEP-TALN-RECITAL 2012, volume
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2: TALN, Grenoble, France, June 4-8, 2012,
2018. Contextual string embeddings for se- pages 321–334.
quence labeling. In Proceedings of the 27th
Branden Chan, Timo Möller, Malte Pietsch, Tanay
International Conference on Computational Lin-
Soni, and Chin Man Yeung. 2019. German bert.
guistics, COLING 2018, Santa Fe, New Mexico,
https://deepset.ai/german-bert.
USA, August 20-26, 2018, pages 1638–1649. As-
sociation for Computational Linguistics. Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Rie Kubota Ando and Tong Zhang. 2005. A frame-
Francisco Guzmán, Edouard Grave, Myle Ott,
work for learning predictive structures from mul-
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
tiple tasks and unlabeled data. J. Mach. Learn.
Unsupervised cross-lingual representation learn-
Res., 6:1817–1853.
ing at scale. ArXiv preprint : 1911.02116.
Rachel Bawden, Marie-Amélie Botalla, Kim
Alexis Conneau, Ruty Rinott, Guillaume Lample,
Gerdes, and Sylvain Kahane. 2014. Correct-
Adina Williams, Samuel R. Bowman, Holger
ing and validating syntactic dependency in the
Schwenk, and Veselin Stoyanov. 2018. XNLI:
spoken French treebank rhapsodie. In Proceed-
evaluating cross-lingual sentence representa-
ings of the Ninth International Conference on
tions. In Proceedings of the 2018 Conference
Language Resources and Evaluation (LREC’14),
on Empirical Methods in Natural Language Pro-
pages 2320–2325, Reykjavik, Iceland. European
cessing, Brussels, Belgium, October 31 - Novem-
Language Resources Association (ELRA).
ber 4, 2018, pages 2475–2485. Association for
Michaël Benesty. 2019. Ner algo benchmark: spacy, Computational Linguistics.
flair, m-bert and camembert on anonymizing
Andrew M. Dai and Quoc V. Le. 2015. Semi-
french commercial legal cases.
supervised sequence learning. In Advances in
Peter F. Brown, Vincent J. Della Pietra, Peter V. Neural Information Processing Systems 28: An-
de Souza, Jennifer C. Lai, and Robert L. Mercer. nual Conference on Neural Information Process-
1992. Class-based n-gram models of natural ing Systems 2015, December 7-12, 2015, Mon-
language. Computational Linguistics, 18(4):467– treal, Quebec, Canada, pages 3079–3087.
479.
Pieter Delobelle, Thomas Winters, and Bettina
Marie Candito and Benoit Crabbé. 2009. Im- Berendt. 2020. RobBERT: a Dutch RoBERTa-
proving generative statistical parsing with semi- based Language Model. ArXiv preprint
supervised word clustering. In Proc. of IWPT’09, 2001.06286.
Paris, France.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Marie Candito, Guy Perrier, Bruno Guillaume, Kristina Toutanova. 2018. Multilingual bert.
Corentin Ribeyre, Karën Fort, Djamé Seddah, https://github.com/google-research/
and Éric Villemonte de la Clergerie. 2014. Deep bert/blob/master/multilingual.md.
syntax annotation of the sequoia french tree-
bank. In Proceedings of the Ninth International Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Conference on Language Resources and Evalua- Kristina Toutanova. 2019. BERT: pre-training
tion, LREC 2014, Reykjavik, Iceland, May 26-31, of deep bidirectional transformers for language
2014., pages 2298–2305. European Language understanding. In Proceedings of the 2019 Con-
Resources Association (ELRA). ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
Marie Candito and Djamé Seddah. 2012. Le cor- man Language Technologies, NAACL-HLT 2019,
pus sequoia : annotation syntaxique et exploita- Minneapolis, MN, USA, June 2-7, 2019, Volume
1 (Long and Short Papers), pages 4171–4186. Diederik P Kingma and Jimmy Ba. 2014. Adam:
Association for Computational Linguistics. A method for stochastic optimization. ArXiv
preprint 1412.6980.
Martin d’Hoffschmidt, Maxime Vidal, Wacim Bel-
blidia, and Tom Brendlé. 2020. Fquad: French Daniel Kondratyuk. 2019. 75 languages, 1
question answering dataset. arXiv preprint model: Parsing universal dependencies univer-
arXiv:2002.06071. sally. CoRR, abs/1904.02099.
Timothy Dozat and Christopher D. Manning. 2017.
Anna Korhonen, David R. Traum, and Lluís
Deep biaffine attention for neural dependency
Màrquez, editors. 2019. Proceedings of the 57th
parsing. In 5th International Conference on
Conference of the Association for Computational
Learning Representations, ICLR 2017, Toulon,
Linguistics, ACL 2019, Florence, Italy, July 28-
France, April 24-26, 2017, Conference Track
August 2, 2019, Volume 1: Long Papers. Associ-
Proceedings. OpenReview.net.
ation for Computational Linguistics.
Yoann Dupont. 2017. Exploration de traits
pour la reconnaissance d’entit’es nomm’ees du Taku Kudo. 2018. Subword regularization: Im-
français par apprentissage automatique. In 24e proving neural network translation models with
Conf’erence sur le Traitement Automatique des multiple subword candidates. In Proceedings
Langues Naturelles (TALN), page 42. of the 56th Annual Meeting of the Association
for Computational Linguistics, ACL 2018, Mel-
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, bourne, Australia, July 15-20, 2018, Volume 1:
Armand Joulin, and Tomas Mikolov. 2018. Long Papers, pages 66–75. Association for Com-
Learning word vectors for 157 languages. In putational Linguistics.
Proceedings of the Eleventh International Con-
ference on Language Resources and Evalua- Taku Kudo and John Richardson. 2018. Sentence-
tion, LREC 2018, Miyazaki, Japan, May 7-12, piece: A simple and language independent sub-
2018. European Language Resources Associa- word tokenizer and detokenizer for neural text
tion (ELRA). processing. In Proceedings of the 2018 Confer-
ence on Empirical Methods in Natural Language
Edouard Grave, Tomas Mikolov, Armand Joulin, Processing, EMNLP 2018: System Demonstra-
and Piotr Bojanowski. 2017. Bag of tricks for tions, Brussels, Belgium, October 31 - November
efficient text classification. In Proceedings of 4, 2018, pages 66–71. Association for Computa-
the 15th Conference of the European Chapter of tional Linguistics.
the Association for Computational Linguistics,
EACL 2017, Valencia, Spain, April 3-7, 2017, Anne Lacheret, Sylvain Kahane, Julie Beliao, Anne
Volume 2: Short Papers, pages 427–431. Associ- Dister, Kim Gerdes, Jean-Philippe Goldman,
ation for Computational Linguistics. Nicolas Obin, Paola Pietrandrea, and Atanas
Tchobanov. 2014. Rhapsodie: a prosodic-
Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.
syntactic treebank for spoken French. In
2019. What does BERT learn about the structure
Proceedings of the Ninth International Con-
of language? In (Korhonen et al., 2019), pages
ference on Language Resources and Evalua-
3651–3657.
tion (LREC’14), pages 295–301, Reykjavik, Ice-
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. land. European Language Resources Association
Weld, Luke Zettlemoyer, and Omer Levy. (ELRA).
2019. Spanbert: Improving pre-training by
representing and predicting spans. CoRR, John D. Lafferty, Andrew McCallum, and Fernando
abs/1907.10529. C. N. Pereira. 2001. Conditional random fields:
Probabilistic models for segmenting and labeling
Armand Joulin, Edouard Grave, Piotr Bojanowski, sequence data. In Proceedings of the Eighteenth
Matthijs Douze, Hervé Jégou, and Tomas International Conference on Machine Learning
Mikolov. 2016. Fasttext.zip: Compressing (ICML 2001), Williams College, Williamstown,
text classification models. ArXiv preprint MA, USA, June 28 - July 1, 2001, pages 282–289.
1612.03651. Morgan Kaufmann.
Guillaume Lample, Miguel Ballesteros, Sandeep vances in pre-training distributed word represen-
Subramanian, Kazuya Kawakami, and Chris tations. In Proceedings of the Eleventh Interna-
Dyer. 2016. Neural architectures for named en- tional Conference on Language Resources and
tity recognition. In NAACL HLT 2016, The 2016 Evaluation, LREC 2018, Miyazaki, Japan, May
Conference of the North American Chapter of 7-12, 2018.
the Association for Computational Linguistics:
Human Language Technologies, San Diego Cali- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gre-
fornia, USA, June 12-17, 2016, pages 260–270. gory S. Corrado, and Jeffrey Dean. 2013. Dis-
The Association for Computational Linguistics. tributed representations of words and phrases
and their compositionality. In Advances in Neu-
Guillaume Lample and Alexis Conneau. 2019. ral Information Processing Systems 26: 27th
Cross-lingual language model pretraining. Annual Conference on Neural Information Pro-
CoRR, abs/1901.07291. cessing Systems 2013. Proceedings of a meeting
held December 5-8, 2013, Lake Tahoe, Nevada,
Zhenzhong Lan, Mingda Chen, Sebastian Good-
United States., pages 3111–3119.
man, Kevin Gimpel, Piyush Sharma, and Radu
Soricut. 2019. ALBERT: A lite BERT for self-
Joakim Nivre, Mitchell Abrams, Željko Agić,
supervised learning of language representations.
Lars Ahrenberg, Lene Antonsen, Maria Je-
ArXiv preprint 1909.11942.
sus Aranzabe, Gashaw Arutie, Masayuki Asa-
Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, hara, Luma Ateyah, Mohammed Attia, Aitziber
Maximin Coavoux, Benjamin Lecouteux, Atutxa, Liesbeth Augustinus, Elena Badmaeva,
Alexandre Allauzen, Benoît Crabbé, Laurent Miguel Ballesteros, Esha Banerjee, Sebastian
Besacier, and Didier Schwab. 2019. Flaubert: Bank, Verginica Barbu Mititelu, John Bauer,
Unsupervised language model pre-training for Sandra Bellato, Kepa Bengoetxea, Riyaz Ah-
french. ArXiv : 1912.05372. mad Bhat, Erica Biagetti, Eckhard Bick, Ro-
gier Blokland, Victoria Bobicev, Carl Börstell,
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Cristina Bosco, Gosse Bouma, Sam Bowman,
Mandar Joshi, Danqi Chen, Omer Levy, Mike Adriane Boyd, Aljoscha Burchardt, Marie Can-
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. dito, Bernard Caron, Gauthier Caron, Gülşen Ce-
2019. Roberta: A robustly optimized BERT pre- biroğlu Eryiğit, Giuseppe G. A. Celano, Savas
training approach. ArXiv preprint 1907.11692. Cetin, Fabricio Chalub, Jinho Choi, Yongseok
Cho, Jayeol Chun, Silvie Cinková, Aurélie Col-
Louis Martin, Benjamin Muller, Pedro Javier Ortiz lomb, Çağrı Çöltekin, Miriam Connor, Marine
Suárez, Yoann Dupont, Laurent Romary, Éric Courtin, Elizabeth Davidson, Marie-Catherine
Villemonte de la Clergerie, Djamé Seddah, and de Marneffe, Valeria de Paiva, Arantza Diaz de
Benoît Sagot. 2019. CamemBERT: a Tasty Ilarraza, Carly Dickerson, Peter Dirix, Kaja
French Language Model. arXiv e-prints. ArXiv Dobrovoljc, Timothy Dozat, Kira Droganova,
preprint : 1911.03894. Puneet Dwivedi, Marhaba Eli, Ali Elkahky,
Ryan McDonald, Joakim Nivre, Yvonne Binyam Ephrem, Tomaž Erjavec, Aline Eti-
Quirmbach-Brundage, Yoav Goldberg, Dipanjan enne, Richárd Farkas, Hector Fernandez Al-
Das, Kuzman Ganchev, Keith Hall, Slav Petrov, calde, Jennifer Foster, Cláudia Freitas, Katarína
Hao Zhang, Oscar Täckström, Claudia Bedini, Gajdošová, Daniel Galbraith, Marcos Garcia,
Núria Bertomeu Castelló, and Jungmee Lee. Moa Gärdenfors, Kim Gerdes, Filip Ginter,
2013. Universal dependency annotation for Iakes Goenaga, Koldo Gojenola, Memduh Gökır-
multilingual parsing. In Proceedings of the 51st mak, Yoav Goldberg, Xavier Gómez Guino-
Annual Meeting of the Association for Compu- vart, Berta Gonzáles Saavedra, Matias Gri-
tational Linguistics (Volume 2: Short Papers), oni, Normunds Grūzı̄tis, Bruno Guillaume,
pages 92–97, Sofia, Bulgaria. Association for Céline Guillot-Barbance, Nizar Habash, Jan
Computational Linguistics. Hajič, Jan Hajič jr., Linh Hà Mỹ, Na-Rae
Han, Kim Harris, Dag Haug, Barbora Hladká,
Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Jaroslava Hlaváčová, Florinel Hociung, Petter
Christian Puhrsch, and Armand Joulin. 2018. Ad- Hohle, Jena Hwang, Radu Ion, Elena Irimia,
Tomáš Jelínek, Anders Johannsen, Fredrik Jør- Stella, Milan Straka, Jana Strnadová, Alane
gensen, Hüner Kaşıkara, Sylvain Kahane, Hi- Suhr, Umut Sulubacak, Zsolt Szántó, Dima Taji,
roshi Kanayama, Jenna Kanerva, Tolga Kayade- Yuta Takahashi, Takaaki Tanaka, Isabelle Tellier,
len, Václava Kettnerová, Jesse Kirchner, Na- Trond Trosterud, Anna Trukhina, Reut Tsarfaty,
talia Kotsyba, Simon Krek, Sookyoung Kwak, Francis Tyers, Sumire Uematsu, Zdeňka Ure-
Veronika Laippala, Lorenzo Lambertino, Tatiana šová, Larraitz Uria, Hans Uszkoreit, Sowmya
Lando, Septina Dian Larasati, Alexei Lavren- Vajjala, Daniel van Niekerk, Gertjan van No-
tiev, John Lee, Phương Lê Hồng, Alessan- ord, Viktor Varga, Veronika Vincze, Lars Wallin,
dro Lenci, Saran Lertpradit, Herman Leung, Jonathan North Washington, Seyi Williams,
Cheuk Ying Li, Josie Li, Keying Li, Kyung- Mats Wirén, Tsegay Woldemariam, Tak-sum
Tae Lim, Nikola Ljubešić, Olga Loginova, Olga Wong, Chunxiao Yan, Marat M. Yavrumyan,
Lyashevskaya, Teresa Lynn, Vivien Macketanz, Zhuoran Yu, Zdeněk Žabokrtský, Amir Zeldes,
Aibek Makazhanov, Michael Mandl, Christo- Daniel Zeman, Manying Zhang, and Hanzhi
pher Manning, Ruli Manurung, Cătălina Mărăn- Zhu. 2018. Universal dependencies 2.2. LIN-
duc, David Mareček, Katrin Marheinecke, Héc- DAT/CLARIN digital library at the Institute of
tor Martínez Alonso, André Martins, Jan Mašek, Formal and Applied Linguistics (ÚFAL), Faculty
Yuji Matsumoto, Ryan McDonald, Gustavo Men- of Mathematics and Physics, Charles University.
donça, Niko Miekka, Anna Missilä, Cătălin Mi-
titelu, Yusuke Miyao, Simonetta Montemagni, Debora Nozza, Federico Bianchi, and Dirk Hovy.
Amir More, Laura Moreno Romero, Shinsuke 2020. What the
Mori, Bjartur Mortensen, Bohdan Moskalevskyi, mask
Kadri Muischnek, Yugo Murawaki, Kaili
? making sense of language-specific bert models.
Müürisep, Pinkey Nainwani, Juan Ignacio
Navarro Horñiacek, Anna Nedoluzhko, Gunta Pedro Javier Ortiz Suárez, Benoît Sagot, and Lau-
Nešpore-Bērzkalne, Lương Nguyễn Thi., Huyền rent Romary. 2019. Asynchronous pipeline for
Nguyễn Thi. Minh, Vitaly Nikolaev, Rattima processing huge corpora on medium to low re-
Nitisaroj, Hanna Nurmi, Stina Ojala, Adédayò. source infrastructures. Challenges in the Man-
Olúòkun, Mai Omura, Petya Osenova, Robert agement of Large Corpora (CMLC-7) 2019,
Östling, Lilja Øvrelid, Niko Partanen, Elena page 9.
Pascual, Marco Passarotti, Agnieszka Patejuk,
Siyao Peng, Cenel-Augusto Perez, Guy Per- Myle Ott, Sergey Edunov, Alexei Baevski, Angela
rier, Slav Petrov, Jussi Piitulainen, Emily Pitler, Fan, Sam Gross, Nathan Ng, David Grangier,
Barbara Plank, Thierry Poibeau, Martin Popel, and Michael Auli. 2019. fairseq: A fast, ex-
Lauma Pretkalnin, a, Sophie Prévost, Prokopis tensible toolkit for sequence modeling. In Pro-
Prokopidis, Adam Przepiórkowski, Tiina Puo- ceedings of the 2019 Conference of the North
lakainen, Sampo Pyysalo, Andriela Rääbis, American Chapter of the Association for Compu-
Alexandre Rademaker, Loganathan Ramasamy, tational Linguistics: Human Language Technolo-
Taraka Rama, Carlos Ramisch, Vinit Ravis- gies, NAACL-HLT 2019, Minneapolis, MN, USA,
hankar, Livy Real, Siva Reddy, Georg Rehm, June 2-7, 2019, Demonstrations, pages 48–53.
Michael Rießler, Larissa Rinaldi, Laura Rituma, Association for Computational Linguistics.
Luisa Rocha, Mykhailo Romanenko, Rudolf
Rosa, Davide Rovati, Valentin Ros, ca, Olga Rud- Jeffrey Pennington, Richard Socher, and Christo-
ina, Shoval Sadde, Shadi Saleh, Tanja Samardžić, pher D. Manning. 2014. Glove: Global vectors
Stephanie Samson, Manuela Sanguinetti, Baiba for word representation. In Proceedings of the
Saulı̄te, Yanin Sawanakunanon, Nathan Schnei- 2014 Conference on Empirical Methods in Nat-
der, Sebastian Schuster, Djamé Seddah, Wolf- ural Language Processing, EMNLP 2014, Oc-
gang Seeker, Mojgan Seraji, Mo Shen, Atsuko tober 25-29, 2014, Doha, Qatar, A meeting of
Shimada, Muh Shohibussirri, Dmitry Sichinava, SIGDAT, a Special Interest Group of the ACL,
Natalia Silveira, Maria Simi, Radu Simionescu, pages 1532–1543. ACL.
Katalin Simkó, Mária Šimková, Kiril Simov,
Aaron Smith, Isabela Soares-Bastos, Antonio Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextual- Mike Schuster and Kaisuke Nakajima. 2012.
ized word representations. In Proceedings of the Japanese and korean voice search. In 2012 IEEE
2018 Conference of the North American Chapter International Conference on Acoustics, Speech
of the Association for Computational Linguistics: and Signal Processing (ICASSP), pages 5149–
Human Language Technologies, NAACL-HLT 5152. IEEE.
2018, New Orleans, Louisiana, USA, June 1-6,
2018, Volume 1 (Long Papers), pages 2227–2237. Amit Seker, Amir More, and Reut Tsarfaty. 2018.
Association for Computational Linguistics. Universal morpho-syntactic parsing and the con-
tribution of lexica: Analyzing the onlp lab sub-
Slav Petrov, Dipanjan Das, and Ryan T. McDonald. mission to the conll 2018 shared task. In Pro-
2012. A universal part-of-speech tagset. In Pro- ceedings of the CoNLL 2018 Shared Task: Mul-
ceedings of the Eighth International Conference tilingual Parsing from Raw Text to Universal
on Language Resources and Evaluation, LREC Dependencies, pages 208–215.
2012, Istanbul, Turkey, May 23-25, 2012, pages
2089–2096. European Language Resources As- Rico Sennrich, Barry Haddow, and Alexandra
sociation (ELRA). Birch. 2016. Neural machine translation of rare
words with subword units. In Proceedings of the
Telmo Pires, Eva Schlinger, and Dan Garrette. 54th Annual Meeting of the Association for Com-
2019. How multilingual is multilingual bert? putational Linguistics, ACL 2016, August 7-12,
arXiv:1906.01502. 2016, Berlin, Germany, Volume 1: Long Papers.
The Association for Computer Linguistics.
Alec Radford, Jeffrey Wu, Rewon Child,
David Luan, Dario Amodei, and Ilya Milan Straka. 2018. Udpipe 2.0 prototype at conll
Sutskever. 2019. Language models are 2018 ud shared task. In Proceedings of the
unsupervised multitask learners. Preprint, CoNLL 2018 Shared Task: Multilingual Pars-
https://paperswithcode.com/paper/ ing from Raw Text to Universal Dependencies,
language-models-are-unsupervised-multitask. pages 197–207.
Colin Raffel, Noam Shazeer, Adam Roberts, Milan Straka, Jana Straková, and Jan Hajic. 2019.
Katherine Lee, Sharan Narang, Michael Matena, Evaluating contextualized embeddings on 54 lan-
Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Ex- guages in POS tagging, lemmatization and de-
ploring the limits of transfer learning with a pendency parsing. ArXiv preprint 1908.07448.
unified text-to-text transformer. ArXiv preprint
Jana Straková, Milan Straka, and Jan Hajic. 2019.
1910.10683.
Neural architectures for nested NER through lin-
Benoît Sagot, Marion Richard, and Rosa Stern. earization. In (Korhonen et al., 2019), pages
2012. Annotation référentielle du corpus arboré 5326–5331.
de Paris 7 en entités nommées (referential named Wilson L Taylor. 1953. “cloze procedure”: A new
entity annotation of the paris 7 french treebank) tool for measuring readability. Journalism Bul-
[in french]. In Proceedings of the Joint Con- letin, 30(4):415–433.
ference JEP-TALN-RECITAL 2012, volume 2:
TALN, Grenoble, France, June 4-8, 2012, pages Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
535–542. ATALA/AFCP. BERT rediscovers the classical NLP pipeline.
In Proceedings of the 57th Annual Meeting of
Manuela Sanguinetti and Cristina Bosco. 2015. the Association for Computational Linguistics,
PartTUT: The Turin University Parallel Tree- pages 4593–4601, Florence, Italy. Association
bank. In Roberto Basili, Cristina Bosco, Rodolfo for Computational Linguistics.
Delmonte, Alessandro Moschitti, and Maria
Simi, editors, Harmonization and Development Ashish Vaswani, Noam Shazeer, Niki Parmar,
of Resources and Tools for Italian Natural Lan- Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
guage Processing within the PARLI Project, vol- Lukasz Kaiser, and Illia Polosukhin. 2017. At-
ume 589 of Studies in Computational Intelli- tention is all you need. In Advances in Neu-
gence, pages 51–69. Springer. ral Information Processing Systems 30: Annual
Conference on Neural Information Processing
Systems 2017, 4-9 December 2017, Long Beach,
CA, USA, pages 5998–6008.
95 7.5
In the appendix, we analyse different design
90 7.0
choices of CamemBERT (Table 8), namely with
85 6.5
Perplexity
respect to the use of whole-word masking, the train-
Scores
ing dataset, the model size, and the number of train- 80 6.0
ing steps in complement with the analyses of the 75 5.5
impact of corpus origin an size (Section 6. In all the 70 5.0
ablations, all scores come from at least 4 averaged Parsing
NER
65
runs. For POS tagging and dependency parsing, we NLI 4.5
Language Modelling
60
average the scores on the 4 treebanks. We also re- 0 20000 40000 60000 80000 100000
steps
port all averaged test scores of our different models
in Table 7. Figure 1: Impact of number of pretraining steps on
downstream performance for CamemBERT.
A Impact of Whole-Word Masking .
In Table 8, we compare models trained using
the traditional subword masking with whole-word larger +1.31 improvement on NER. The CCNet
masking. Whole-Word Masking positively impacts model gets better performance on NLI (+0.67).
downstream performances for NLI (although only
D Impact of number of steps
by 0.5 points of accuracy). To our surprise, this
Whole-Word Masking scheme does not benefit Figure 1 displays the evolution of downstream task
much lower level task such as Name Entity Recog- performance with respect to the number of steps.
nition, POS tagging and Dependency Parsing. All scores in this section are averages from at least
4 runs with different random seeds. For POS tag-
B Impact of model size ging and dependency parsing, we also average the
Table 8 compares models trained with the BASE scores on the 4 treebanks.
and LARGE architectures. These models were We evaluate our model at every epoch (1 epoch
trained with the CCNet corpus (135GB) for prac- equals 8360 steps). We report the masked language
tical reasons. We confirm the positive influence modelling perplexity along with downstream per-
of larger models on the NLI and NER tasks. The formances. Figure 1, suggests that the more com-
LARGE architecture leads to respectively 19.7% plex the task the more impactful the number of
error reduction and 23.7%. To our surprise, on POS steps is. We observe an early plateau for depen-
tagging and dependency parsing, having three time dency parsing and NER at around 22k steps, while
more parameters doesn’t lead to a significant differ- for NLI, even if the marginal improvement with
ence compared to the BASE model. Tenney et al. regard to pretraining steps becomes smaller, the
(2019) and Jawahar et al. (2019) have shown that performance is still slowly increasing at 100k steps.
low-level syntactic capabilities are learnt in lower In Table 8, we compare two models trained on
layers of BERT while higher level semantic repre- CCNet, one for 100k steps and the other for 500k
sentations are found in upper layers of BERT. POS steps to evaluate the influence of the total number
tagging and dependency parsing probably do not of steps. The model trained for 500k steps does
benefit from adding more layers as the lower layers not increase the scores much from just training
of the BASE architecture already capture what is for 100k steps in POS tagging and parsing. The
necessary to complete these tasks. increase is slightly higher for XNLI (+0.84).
Those results suggest that low level syntactic
C Impact of training dataset representation are captured early in the language
model training process while it needs more steps
Table 8 compares models trained on CCNet and to extract complex semantic information as needed
on OSCAR. The major difference between the two for NLI.
datasets is the additional filtering step of CCNet
that favors Wikipedia-Like texts. The model pre-
trained on OSCAR gets slightly better results on
POS tagging and dependency parsing, but gets a
GSD S EQUOIA S POKEN PARTUT NER NLI
DATASET M ASKING A RCH . #S TEPS
UPOS LAS UPOS LAS UPOS LAS UPOS LAS F1 ACC .
Fine-tuning
OSCAR Subword BASE 100k 98.25 92.29 99.25 93.70 96.95 79.96 97.73 92.68 89.23 81.18
OSCAR Whole-word BASE 100k 98.21 92.30 99.21 94.33 96.97 80.16 97.78 92.65 89.11 81.92
CCNET Subword BASE 100k 98.02 92.06 99.26 94.13 96.94 80.39 97.55 92.66 89.05 81.77
CCNET Whole-word BASE 100k 98.03 92.43 99.18 94.26 96.98 80.89 97.46 92.33 89.27 81.92
CCNET Whole-word BASE 500k 98.21 92.43 99.24 94.60 96.69 80.97 97.65 92.48 89.08 83.43
CCNET Whole-word L ARGE 100k 98.01 91.09 99.23 93.65 97.01 80.89 97.41 92.59 89.39 85.29
Embeddings (with UDPipe Future (tagging, parsing) or LSTM+CRF (NER))
OSCAR Subword BASE 100k 98.01 90.64 99.27 94.26 97.15 82.56 97.70 92.70 90.25 -
OSCAR Whole-word BASE 100k 97.97 90.44 99.23 93.93 97.08 81.74 97.50 92.28 89.48 -
CCNET Subword BASE 100k 97.87 90.78 99.20 94.33 97.17 82.39 97.54 92.51 89.38 -
CCNET Whole-word BASE 100k 97.96 90.76 99.23 94.34 97.04 82.09 97.39 92.82 89.85 -
CCNET Whole-word BASE 500k 97.84 90.25 99.14 93.96 97.01 82.17 97.27 92.28 89.07 -
CCNET Whole-word L ARGE 100k 98.01 90.70 99.23 94.01 97.04 82.18 97.31 92.28 88.76 -
Table 7: Performance reported on Test sets for all trained models (average over multiple fine-tuning seeds).
Table 8: Comparing scores on the Validation sets of different design choices. POS tagging and parsing datasets
are averaged. (average over multiple fine-tuning seeds).