Usage

Import the required packages. Please note that I have not included the usual suspects such as os, pandas, etc.

Define general parameters and path locations for data, labels and pretrained models. (some good engineering practices)

Tokenizer

Create a tokenizer object. The is the BPE based WordPiece tokenizer and is available from the magnificient Hugging Face BERT PyTorch library.

The do_lower_case parameter depends on the version of the BERT pretrained model you have used. In case you use uncased models, set this value to true, else set it to false. For this example we have use the BERT base uncased model and hence do_lower_case parameter is set to true.

GPU & Device

Training a BERT model does require a single or more preferably multiple GPUs. In this step we can setup GPU parameters for our training.

Note that in the future releases, this step will be abstracted from the user and the library will automatically determine the correct device profile.

BertDataBunch

This is an excellent idea borrowed from fast.ai library. The databunch object takes training, validation and test csv files and converts the data into internal representation for BERT. The object also instantiates the correct data-loaders based on device profile and batch_size and max_sequence_length.

The DataBunch object provides the location to the data files and the label.csv file. For each of the data files, i.e. train.csv, val.csv and/or test.csv, the databunch creates a dataloader object by converting the csv data into BERT-specific input objects. I would encourage you to explore the structure of the databunch object using Jupyter notebook.

BertLearner

Another concept in line with the fast.ai library, BertLearner is the ‘learner’ object that holds everything together. It encapsulates the key logic for the lifecycle of the model such as training, validation and inference.

The learner object will take the databunch created earlier as as input alongwith some of the other parameters such as location for one of the pretrained BERT models, FP16 training, multi_gpu and multi_label options.

The learner class contains the logic for training loop, validation loop, optimiser strategies and key metrics calculation. This help the developers focus on their custom use-cases without worrying about these repetitive activities.

At the same time the learner object is flexible enough to be customised either via using flexible parameters or by creating a subclass of BertLearner and redefining relevant methods.

The learner object does the following upon initiation:

Creates a PyTorch BERT model and initialises the same with provided pre-trained weights. Based on the multi_label parameter, the model class will be BertForSequenceClassification or BertForMultiLabelSequenceClassification. Assigns the model to the right device, i.e. CUDA based GPU or CPU. if Nvidia Apex is available, the distributed processing functions of Apex will be utilised.

fast-bert provides a bunch of metrics. for multi-class classification, you will generally use accuracy whereas for multi-label classification, you should consider using accuracy_thresh and/or roc_auc.

Train the model

Start the model training by calling fit method on the learner object. the method takes epoch, learning rate and optimiser schedule_type as input. Following schedule types are supported (again courtesy of the Hugging Face Bert library):

none : always returns learning rate 1.

: always returns learning rate 1. warmup_constant : Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. Keeps learning rate equal to 1. after warmup.

warmup_linear : Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. Linearly decreases learning rate from 1. to 0. over remaining 1 - warmup steps.

warmup_cosine : Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. Decreases learning rate from 1. to 0. over remaining 1 - warmup steps following a cosine curve. If cycles (default=0.5) is different from default, learning rate follows cosine function after warmup.

warmup_cosine_hard_restarts : Linearly increases learning rate from 0 to 1 over warmup fraction of training steps. If cycles (default=1.) is different from default, learning rate follows cycles times a cosine decaying learning rate (with hard restarts).

warmup_cosine_warmup_restarts : All training progress is divided in cycles (default=1.) parts of equal length. Every part follows a schedule with the first warmup fraction of the training steps linearly increasing from 0. to 1., followed by a learning rate decreasing from 1. to 0. following a cosine curve. Note that the total number of all warmup steps over all cycles together is equal to warmup * cycles

On calling the fit method, the library will start printing the progress information on the logger object. It will print training and validation losses, and the metric that you have requested.

In order to repeat the experiment with different parameters, just create a new learner object and call fit method on the same. If you have tons of GPU compute, then you can possibly run multiple experiments in parallel by instantiating multiple databunch and learner objects at the same time.

Once you are happy with your experiments, call the save_and_reload method on learner object to persist the model on the file structure.