French has a long history as a primary or second language in the United Nations, European Union, Olympic Games, and countless other international arenas and organizations. Yet the proportion of natural language processing (NLP) research in the machine learning community that has focused on Voltaire’s tongue remains relatively small. Now, a team from Facebook AI Research, Inria, and Sorbonne Université have released CamemBERT, essentially a French version of Google AI’s game-changing pretrained language model BERT (Bidirectional Encoder Representations from Transformers).

Pretrained language models with transformer-based architectures such as BERT, GPT-2, RoBERTa, ALBERT, and T5 have enabled rapid advancements in the science of NLP and natural language understanding (NLU), including changing how machines fundamentally approach common queries. Most of the best models to come out of this hot research area however have been trained on and directed at the English language. RoBERTa for example was trained on more than 100GB of English-language data.

Because the performance of pretrained language models can be significantly improved by using more training data, researchers harvested a bunch of French text from the newly available large multilingual corpus OSCAR. The unshuffled version of the French OSCAR corpus comprises 138GB of uncompressed text and 32.7B SentencePiece tokens.

The researchers evaluated CamemBERT on four downstream tasks:

Part-of-speech (POS) tagging, a low-level syntactic task which involves assigning each word to its corresponding grammatical category

Dependency parsing, which involves predicting the labeled syntactic tree capturing the syntactic relations between words

Named Entity Recognition (NER), a sequence labeling task for predicting which words refer to real-world objects, such as people, locations, artifacts and organisations

Natural language inference (NLI), which involves predicting whether a hypothesis sentence is entailed, neutral or contradicts a premise sentence.

CamemBERT outperformed other French-language models in tests. Researchers believe the new pretrained model can be effectively fine-tuned for various downstream tasks, and that it opens a promising path for future research in French NLP.

The machine learning research community responded quickly to the CamemBERT release. Clement Delangue, CEO of NLP startup Hugging Face, tweeted “CamemBERT will revolutionize how to do NLP in French, the same way BERT did for English. Looking forward to seeing this happening for every single language to truly democratize NLP.”

The paper CamemBERT, a Tasty French Language Model is on arXiv.