The proposed model (called DialogueRNN), illustrated in the figure above, determines the final emotion of the utterance through the following factors:

Party state — models the parties’ emotion dynamics through the conversations. The basic idea behind the party state is to ensure that the model is aware of the speaker of each utterance in the conversation.

— models the parties’ emotion dynamics through the conversations. The basic idea behind the party state is to ensure that the model is aware of the speaker of each utterance in the conversation. Global state — models the context of an utterance in the dialogue, given by jointly encoding preceding utterances and the party state. Note that attention mechanism is applied to the global state to provide improved context representation. This state basically serves as the speaker-specific utterance representation.

— models the context of an utterance in the dialogue, given by jointly encoding preceding utterances and the party state. Note that attention mechanism is applied to the global state to provide improved context representation. This state basically serves as the speaker-specific utterance representation. Emotion representation — inferred through party state and preceding speaker’s states as context (global state). This representation is used to perform the final emotion classification via a softmax layer.

Each component of the architecture is modeled by a gated recurrent unit (GRU). It’s important to note that during training, the speaker state is updated using the current utterance along with its context, which is nothing less than the preceding global states applied an attention mechanism. The role of the attention mechanism is that it assigns higher attention scores to the utterances that are emotionally relevant to the current utterance.

Overall, the speaker update encodes — via the Party GRU (shown in blue) — the information on the current utterance along with its context from the Global GRU (shown in green). All this information is important for performing the final emotion classification, which is performed by the emotion GRU (shown in maroon). Note that the current emotion classification also relies on the previous emotion-relevant information as well.

Variants

Several variants of the DialogueRNN model are proposed and compared in this study:

DialogueRNN_l — considers an extra listener state (defined at the end of this post) while a speaker utters.

— considers an extra listener state (defined at the end of this post) while a speaker utters. BiDialogueRNN — a bidirectional RNN architecture is used instead

— a bidirectional RNN architecture is used instead DialogueRNN+Att — attention is applied over all surrounding emotion representations

— attention is applied over all surrounding emotion representations BiDialogueRNN+Att — similar to the previous model but considers a bidirectional RNN instead

Other baselines are also proposed which you can refer to in the paper.

Results

Two datasets are used for all experiments: IEMOCAP and AVEC. Both datasets contain interactions between multiple parties.

From the table below, we can observe that DialogueRNN (highlighted in green) outperforms all baselines and the state-of-the-art model (CMN) on both datasets. Note that these results are only using the text modality.