This is a post for the EMNLP 2019 paper The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives.

LMs gradually forget past when forming predictions about future;

for MLMs, the evolution proceeds in two stages of context encoding and token reconstruction ;

and ; MT representations get refined with context, but less processing is happening.

We look at the evolution of representations of individual tokens in Transformers trained with different training objectives (MT, LM, MLM - BERT-style) from the Information Bottleneck perspective and show, that: