Microsoft recently updated the performance of their Multi-Task Deep Neural Network (MT-DNN) ensemble model. The significant performance boost has the model sitting comfortably atop the benchmark GLUE benchmark rankings, outperforming human baselines for the first time in terms of macro-average score (87.6 vs.85.1).

GLUE (General Language Understanding Evaluation) is a multi-task benchmark and analysis platform for natural language understanding. MT-DNN was improved in early April and climbed to second place on the GLUE leaderboard, trailing only human baselines. Two months later the MT-DNN-ensemble is back with a huge performance improvement on the WNLI task, scoring 20 points higher than most previous methods as shown in the following table.

GLUE LeaderBoard

WNLI is a reading comprehension task that uses sentences containing a pronoun and a list of possible referents to test a model’s ability to solve pronoun disambiguation problems. It is one of the harder tasks ranked on GLUE. Even educated people can struggle on particulars in the WNLI test, and so the high accuracy of the MT-DNN-ensemble model came as a pleasant surprise. Microsoft researchers didn’t reveal many details on the multi-model integration of MT-DNN, which draws on the paper Multi-Task Deep Neural Networks for Natural Language Understanding.

A fundamental part of natural language understanding is language embedding learning — the process of mapping symbolic natural language text to semantic vector representations. This is what Multi-Task Deep Neural Network models attempt to do — learn universal language embedding.

MT-DNN-ensemble is a multi-task learning method, which means all tasks share the same structure although the objective function of each task is different. It is also a combination of multi-task learning and language model pre-training. The MT-DNN model’s architecture is shown below.

Architecture of the MT-DNN model for representation learning

The lower layers are shared across all tasks. The input X is ﬁrst represented as a sequence of embedding vectors for each token in l_1, then the transformer-based encoder generates shared contextual embedding vectors in l_2. Finally, the top layers are task-specific, thus suitable for learning features that best fit a specific task. Similar to the BERT model, MT-DNN is trained in two phases: pre-training and fine-tuning. Unlike BERT, MT-DNN uses multi-task learning during the fine-tuning phase and has multiple task-specific layers in its model architecture. The experiment results are shown below.

GLUE test set results

The MT-DNN model had achieved similar results to the BERT model on WNLI after its April boost, with accuracy of 65.1 percent. The new improvement to 89 percent accuracy has many in the ML community waiting anxiously for Microsoft to reveal the secret of its success.

The associated paper Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding is on arXiv.