Tokenization and normalization

Before we can start modeling or using any advanced neural network, we need to go through two important steps — tokenization and normalization.

A tokenizer is a tool that performs segmentation work. It cuts text into tags, called tokens. Each token corresponds to a linguistically unique and easily-manipulated label. Tokens are language dependent and are part of a process to normalize the input text to better manipulate it and extract its meaning later in the training process.

When you have a dataset, you’re never 100% sure that the text is clean and normalized. Using a good tokenizer ensures that the text that will get fed to the network is clean and safe.

In some cases, it becomes too difficult to capture meaningful units with just a few rules (especially vocabulary, for example), so a learning approach can be used. An annotated corpus can make it possible to learn the particular tokens to better extend their circulation to all incoming texts.

Thus, using tokenizers pre-trained on large datasets of compound and rare words makes it possible to avoid incorrectly splitting words—for example, words like “Bow tie” or “Father-in-law”.

When building neural networks, you have to choose what kind of data the network will be trained on. Most of the time, existing tokenizers will do the job, but in some cases you want to have the freedom to create your own tokenizer from your own dataset, or maybe your own technique of splitting words. That’s where Hugging Face’s new tokenizer library comes in handy.