The input text gets split into meaningful word pieces before it is fed to the BERT model:

He's your dentist? --> He ' s your den ##tist ?

The special characters ## mark the continuation of a word. For the English examples I had seen so far, the splits always looked perfect. To accomplish the same for 104 languages with different alphabets sounded crazy to me. Driven by curiosity, I installed Hugging Face’s PyTorch implementation of BERT to try it out.

To be honest, this was not what I expected. Neither -o nor -ätzen are common German suffixes, the words seemed split in somewhat arbitrary ways to me. Checking the tokenizer code revealed why:

Basic Tokenization:

The sentence gets split by spaces (and punctuation. And Chinese characters are treated as single words). Word Piece Tokenization:

Each word is split into word pieces that are part of BERT’s vocabulary. If the word as a whole is not in the vocabulary, the tokenizer searches for the longest prefix that is in the vocabulary. The same procedure is done for the remaining end of the word.

So apparently, the German word Hallo is not part of the vocabulary, while the prefix Hall is — eventhough it is not a German word, but (most likely) an English one! There is no language detection, the word piece tokenizer can happen to mix up languages.