Step 1: setting up training environment

First and foremost, we get the packages required to train the model.

The Jupyter environment allows executing bash commands directly from the notebook by using an exclamation mark ‘!’, like this:



!git clone !pip install sentencepiece!git clone https://github.com/google-research/bert

I will be exploiting this approach to make use of several other bash commands throughout the experiment.

Now, let’s import the packages and authorize ourselves in Google Cloud.

Setting up BERT training environment

Step 2: getting the data

We proceed with obtaining a corpus of text data. For this experiment, we will be using the OpenSubtitles dataset, which is available for 65 languages here.

Unlike more commonly used text datasets (like Wikipedia) it does not require any complex pre-processing. It also comes pre-formatted with one sentence per line, which is a requirement for further processing steps.

Feel free to use the dataset for your language instead by setting the corresponding language code.

Download OPUS data

For demonstration purposes, we will only use a small fraction of the whole corpus by default.

When training the real model, make sure to uncheck the DEMO_MODE checkbox to use a 100x larger dataset.

Rest assured, 100M lines are perfectly sufficient to train a reasonably good BERT-base model.

Truncate dataset

Step 3: preprocessing text

The raw text data we have downloaded contains punctuation, uppercase letters and non-UTF symbols which we will remove before proceeding. During inference, we will apply the same procedure to new data.

If your use-case requires different preprocessing (e.g. if uppercase letters or punctuation are expected during inference), feel free to modify the function below to accommodate for your needs.

Define preprocessing routine

Now let’s preprocess the whole dataset.

Apply preprocessing

Step 4: building the vocabulary

For the next step, we will learn a new vocabulary that we will use to represent our dataset.

The BERT paper uses a WordPiece tokenizer, which is not available in opensource. Instead, we will be using SentencePiece tokenizer in unigram mode. While it is not directly compatible with BERT, with a small hack we can make it work.

SentencePiece requires quite a lot of RAM, so running it on the full dataset in Colab will crash the kernel. To avoid this, we will randomly subsample a fraction of the dataset for building the vocabulary. Another option would be to use a machine with more RAM for this step — that decision is up to you.

Also, SentencePiece adds BOS and EOS control symbols to the vocabulary by default. We disable them explicitly by setting their indices to -1.

The typical values for VOC_SIZE are somewhere in between 32000 and 128000. We reserve NUM_PLACEHOLDERS tokens in case one wants to update the vocabulary and fine-tune the model after the pre-training phase is finished.

Learn SentencePiece vocabulary

Now, let’s see how we can make SentencePiece work for the BERT model.

Below is a sentence tokenized using the WordPiece vocabulary from a

pre-trained English BERT-base model from the official repo.

>>> wordpiece.tokenize("Colorless geothermal substations are generating furiously")



['color',

'##less',

'geo',

'##thermal',

'sub',

'##station',

'##s',

'are',

'generating',

'furiously']

As we can see, the WordPiece tokenizer prepends the subwords which occur in the middle of words with ‘##’. The subwords occurring at the beginning of words are unchanged. If the subword occurs both in the beginning and in the middle of words, both versions (with and without ‘##’) are added to the vocabulary.

SentencePiece has created two files: tokenizer.model and tokenizer.vocab. Let’s have a look at the learned vocabulary:

Read the learned SentencePiece vocabulary

This gives:

Learnt vocab size: 31743

Sample tokens: ['▁cafe', '▁slippery', 'xious', '▁resonate', '▁terrier', '▁feat', '▁frequencies', 'ainty', '▁punning', 'modern']

As we may observe, SentencePiece does quite the opposite to WordPiece. From the documentation:

SentencePiece first escapes the whitespace with a meta-symbol “▁” (U+2581) as follows:

Hello▁World .

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

Subwords which occur after whitespace (which are also those that most words begin with) are prepended with ‘▁’, while others are unchanged. This excludes subwords which only occur at the beginning of sentences and nowhere else. These cases should be quite rare, however.

So, in order to obtain a vocabulary analogous to WordPiece, we need to perform a simple conversion, removing “▁” from the tokens that contain it and adding “##” to the ones that don’t.

We also add some special control symbols which are required by the BERT architecture. By convention, we put those at the beginning of the vocabulary.

Additionally, we append some placeholder tokens to the vocabulary.

Those are useful if one wishes to update the pre-trained model with new,

task-specific tokens. In that case, the placeholder tokens are replaced with new real ones, the pre-training data is re-generated, and the model is fine-tuned on new data.

Convert the vocabulary to use for BERT

Finally, we write the obtained vocabulary to file.

Dump vocabulary to file

Now let’s see how the new vocabulary works in practice:

>>> testcase = "Colorless geothermal substations are generating furiously"

>>> bert_tokenizer = tokenization.FullTokenizer(VOC_FNAME)

>>> bert_tokenizer.tokenize(testcase) ['color',

'##less',

'geo',

'##ther',

'##mal',

'sub',

'##station',

'##s',

'are',

'generat',

'##ing',

'furious',

'##ly']

Looking good!

Step 5: generating pre-training data

With the vocabulary at hand, we are ready to generate pre-training data for the BERT model.

Since our dataset might be quite large, we will split it into shards:

Split the dataset

Now, for each shard we need to call create_pretraining_data.py script from the BERT repo. To that end, we will employ the xargs command.

Before we start generating, we need to set some parameters to pass to the script. You can find out more about the meaning of them in the README.

Define parameters for pre-training data

Running this might take quite some time depending on the size of your dataset.

Create pre-training data

Step 6: setting up persistent storage

To preserve our hard-earned assets, we will persist them to Google Cloud Storage. Provided that you have created the GCS bucket, this should be easy.

We will create two directories in GCS, one for the data and one for the model. In the model directory, we will put the model vocabulary and configuration file.

Configure your BUCKET_NAME variable here before proceeding, otherwise you will not be able to train the model.

Configure GCS bucket name

Below is the sample hyperparameter configuration for BERT-base. Change at your own risk!

Configure BERT hyperparameters and save to disk

Now, we are ready to push our assets to GCS

Upload assets to GCS

Step 7: training the model

We are almost ready to begin training our model.

Be advised that some of the parameters from previous steps are duplicated here, so as to allow for convenient restart of the training procedure.

Make sure that the parameters set are exactly the same across the experiment.

Configure training run

Prepare the training run configuration, build the estimator and

input function, power up the bass cannon.

Build estimator model and input function

Execute!

Execute BERT training procedure

Training the model with the default parameters for 1 million steps

will take ~54 hours of run time. In case the kernel restarts for some reason, you may always continue training from the latest checkpoint.

This concludes the guide to pre-training BERT from scratch on a cloud TPU.

Next steps

Okay, we’ve trained the model, now what?

That is a topic for a whole new discussion. There are a couple of things you could do:

The really fun stuff is still to come, so stay woke. Meanwhile, check out the awesome bert-as-a-service project and start serving your newly trained model in production.

Keep learning!

Other guides in this series