The above two papers came before BERT and didn’t use transformer-based architectures.

BERT — The original paper is here, there is also a very good tutorial with illustrations by Jay Alammar here. The pre-trained weight can be downloaded from official Github repo here. BERT is also available as a Tensorflow hub module. There are various other libraries which also make it easy to use the pre-trained embedding to finetune them, they are mentioned in this post later. The below timeline some of the major paper which came after all of them in 2019. Google even started using BERT in production for improving search results.

Timeline for projects after BERT

Transformer-XL — Transformer-XL released in Jan 2019 improves on the transformer by using a architecture which allows learning beyond a fixed length context.

Comparison of Transformer and Evolved Transformer — Source

Evolved Transformer — Around the same time as Transformer-XL, the Evolved Transformer was released, which is a Transformer architecture developed by conducting an evolution-based neural architecture search (NAS).

GPT-2 — After BERT, I think the other project which got the most amount of news coverage was GPT-2 from OpenAI due to its ability to generate almost human-like sentences and also the initial decision from OpenAI to not release the largest model due to concerns regarding the model being used to create fake news, etc. They released the largest model after almost 10 months. You can play with the model at https://talktotransformer.com/ and https://transformer.huggingface.co/ by Huggingface. I think if it different name even a Sesame Street character it would have been even more popular :)

ERNIE and ERNIE 2 — Currently ERNIE 2.0 is at #1 position on the GLUE leaderboard. Github repo.

XLNET

RoBERTa — This paper for FAIR measures the impact of various hyper-parameters of BERT and shows that the original BERT models were under-trained and with more training/tuning it can outperform the initial results. Currently, the results from RoBERTa are at#8 on the GLUE leaderboard!

Salesforce CTRL — CTRL model has 1.6 Billion parameters and provides ways to control the generation of artificial text.

ALBERT — This paper describes parameter reduction techniques to lower memory reduction and increase the training speed of BERT models. ALBERT repo has the pre-trained weights. ALBERT base model has 12M parameters whereas the BERT base model has 110M parameters!

Big BIRD