Development of NLP-related solutions is gaining momentum.

Thanks to the squad of pre-trained models based on the Transformer, BERT and GPT architecture, NLP-related tasks, Grammatical Error Correction (GEC) tasks, in particular, can be completed super efficiently.

In this story I’d like to tell about WebSpellChecker approach to GEC. We developed it when participating in the Shared Task: Grammatical Error Correction within the framework of Building Educational Applications (BEA) 2019 Workshop. I attended this workshop which took place in Florence, Italy on the 2nd of August, 2019 as a WebSpellChecker CEO and a paper co-author. On that day I wanted to present our model and results of participating in BEA in a poster session to numerous conference participants from all over the world.

Before coming to the conference, our Deep Learning Engineer, BDidenko, carried out an extensive research and developed a model. So, for BEA shared task, we created and published an accompanying paper. Below is the summary of our model along with its findings, and future milestones.

Background

The shared task was aimed at creating innovative approaches to automatic correction of all types of errors in written text.

Participants had access to datasets representing various levels and domains of the English language.

The end goal of the competition was to transform incorrect sentences given as an input into correct equivalents as an output.

To create a unique system for solving GEC tasks, we took several steps detailed below.

Data Preprocessing

Datasets in M2 format (a standard format used for annotated GEC tasks) included:

The First Certificate in English (FCE);

Lang-8 Corpus of Learner English (Lang-8);

The National University of Singapore Corpus of Learner English (NUCLE);

English Write & Improve (W&I) and LOCNESS corpus (W&I+LOCNESSv2.1).

Despite being renewed and improved, the datasets had several issues:

lots of irrelevant (noisy) data;

complex form of the info presented in M2 format.

To eliminate them, we started with pre-processing.