The Universal Transformer repeatedly refines a series of vector representations (shown as h 1 to h m ) for each position of the sequence in parallel, by combining information from different positions using self-attention and applying a recurrent transition function. Arrows denote dependencies between operations.