This is a follow-up to our previous post about State of the Art Text Classification. We explain how to do hyperparameter optimisation using Flair Python NLP library to achieve optimal results in text classification outperforming Google’s AutoML Natural Language.

What is hyperparameter optimisation and why can’t we simply do it by hand?

Hyperparameter optimisation (or tuning) is the process of choosing a set of optimal parameters for a machine learning algorithm. Data preprocessors, optimisers and ML algorithms all receive a set of parameters that guide their behaviour. To achieve optimal performance they need to be tuned to fit the statistical properties, type of features and size of the dataset used. The most typical hyperparameters in deep learning include learning rate, number of hidden layers in a deep neural network, batch size, dropout…

In NLP we also encounter a number of other hyperparameters often to do with preprocessing and text embedding such as type of embedding, embedding dimension, number of RNN layers…

Typically, if we’re lucky enough to have a problem simple enough to only require one or two hyperparameters with a few discrete values (like k in k-means for example), we can simply try all possible options. But with increasing number of parameters the trial and error approach becomes difficult.

Our search space grows exponentially with the number of parameters tuned.

Assuming discrete options, this means that if we have 8 parameters where each parameter has 10 discrete options we end up with 10⁸ possible combinations of hyperparameters. This makes hand-picking parameters unfeasible assuming training a model usually requires a considerable amount of time and resources.

There are many hyperparameter optimisation techniques such as grid search, random search, bayesian optimisation, gradient methods and finally TPE. Tree-structured Parzen Estimator (TPE) is the method we used in Flair’s wrapper around Hyperopt - a popular Python hyperparameter optimisation library.

Hyperparameter tuning with Flair

Flair provides a simple API to tune your text classifier parameters. We need to, however, tell it what kinds of hyperparameters it needs to tune and what values it should consider for them.

Running the optimiser is not harder than training the classifier itself, but it requires significantly more time and resources as it essentially executes training a large number of times. Therefore it’s advisable to run this on GPU accelerated hardware.

We will perform hyperparameter optimisation of a text classifier model trained on Kaggle SMS Spam Collection Dataset learning to differentiate between spam and not-spam messages.

Getting ready

To prepare the dataset please refer to the “Preprocessing — Building the Dataset” section of State of the Art Text Classification where we obtain train.csv , test.csv and dev.csv . Make sure datasets are stored in same directory as the script running Flair.

You can check whether you have a GPU available for training by running:

import torch

torch.cuda.is_available()

It returns a boolean indicating whether CUDA is available for PyTorch (on top of which Flair is written).

Tuning the parameters

The first step of hyperparameter optimisation will most likely include defining the search space. This means defining all the hyperparameters we want to tune and whether the optimiser should only consider a set of discrete values for them or search in a bounded continuous space.

For discrete parameters use:

search_space.add(Parameter.PARAMNAME, hp.choice, options=[1, 2, ..])

And for uniform continuous parameters use:

search_space.add(Parameter.PARAMNAME, hp.uniform, low=0.0, high=0.5)

A list of all possible parameters can be seen here.

Next you will need to specify some parameters referring to the type of text classifier we want to use and how many training_runs and epochs to run.

param_selector = TextClassifierParamSelector(

corpus=corpus,

multi_label=False,

base_path='resources/results',

document_embedding_type='lstm',

max_epochs=10,

training_runs=1,

optimization_value=OptimizationValue.DEV_SCORE

)

Note that DEV_SCORE is set as our optimisation value. This is extremely important because we don’t want to optimise our hyperparameters based on the test set as that would cause overfitting.

Finally, we run param_selector.optimize(search_space, max_evals=100) which will execute 100 evaluations of the optimiser and save our results to resources/results/param_selection.txt

Full source code to run the whole process is as follows:

from flair.hyperparameter.param_selection import TextClassifierParamSelector, OptimizationValue

from hyperopt import hp

from flair.hyperparameter.param_selection import SearchSpace, Parameter

from flair.embeddings import WordEmbeddings, FlairEmbeddings

from flair.data_fetcher import NLPTaskDataFetcher

from pathlib import Path corpus = NLPTaskDataFetcher.load_classification_corpus(Path('./'), test_file='test.csv', dev_file='dev.csv', train_file='train.csv') word_embeddings = [[WordEmbeddings('glove'), FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')]] search_space = SearchSpace()

search_space.add(Parameter.EMBEDDINGS, hp.choice, options=word_embeddings)

search_space.add(Parameter.HIDDEN_SIZE, hp.choice, options=[32, 64, 128, 256, 512])

search_space.add(Parameter.RNN_LAYERS, hp.choice, options=[1, 2])

search_space.add(Parameter.DROPOUT, hp.uniform, low=0.0, high=0.5)

search_space.add(Parameter.LEARNING_RATE, hp.choice, options=[0.05, 0.1, 0.15, 0.2])

search_space.add(Parameter.MINI_BATCH_SIZE, hp.choice, options=[16, 32, 64]) param_selector = TextClassifierParamSelector(

corpus=corpus,

multi_label=False,

base_path='resources/results',

document_embedding_type='lstm',

max_epochs=10,

training_runs=1,

optimization_value=OptimizationValue.DEV_SCORE

) param_selector.optimize(search_space, max_evals=100)

Our search space includes learning rate, document embedding hidden size, number of document embedding RNN layers, dropout value and batch size. Note that despite using only one type of word embedding (a stack of news-forward, news-backward, and GloVe) we still had to pass it to search space as it is a required parameter.

Results

The optimiser ran for about 6 hours on a GPU and executed 100 evaluations. The final results were written to resources/results/param_selection.txt .

The last few lines display the best parameter combination as shown below:

--------evaluation run 97

dropout: 0.19686569599930906

embeddings: ./glove.gensim, ./english-forward-v0.2rc.pt, lm-news-english-backward-v0.2rc.pt

hidden_size: 256

learning_rate: 0.05

mini_batch_size: 32

rnn_layers: 2

score: 0.009033333333333374

variance: 8.888888888888905e-07

test_score: 0.9923

...

----------best parameter combination

dropout: 0.19686569599930906

embeddings: 0

hidden_size: 3

learning_rate: 0 <- *this means 0th option*

mini_batch_size: 1

rnn_layers: 1

Based on test_score from the tuning results confirmed by a few further evaluations we achieved a test f1-score of 0.9923 (99.23%)!

This means we outperformed Google’s AutoML by a tiny margin.

Results obtained on Google AutoML Natural Language

hint: if precision = recall then f-score = precision = recall

Does this mean I will always be able to achieve state-of-the art results following this guide?

Short answer: no. The guide should give you a good idea about how to use Flair’s hyperparameter optimiser and is not a comprehensive comparison of NLP text classification frameworks. Using the approaches described will certainly yield results comparable to other state-of-the art frameworks but they will vary depending on the dataset, what preprocessing methods are used and what hyperparameter search space is defined.

Note that when choosing the best parameter combination, Flair takes into account both loss and variance of results obtained. Therefore, the model with the lowest loss and highest f1-score will not necessarily be selected as best.

So how do I use the params to train an actual model now?

To use the best performing parameters on an actual model you need to read the optimal parameters from param_selection.txt and manually copy them one by one to the code that will train our model just like we did in part 1.

While we are extremely happy with the library, it would be much nicer to be able to have the optimal parameters available in a more code-friendly format, or even better, have an option to simply export the optimal model during optimisation.