Opinion Mining with Deep Recurrent Neural Networks — https://www.cs.cornell.edu/~oirsoy/files/emnlp14drnt.pdf

Authors — Ozan Irsoy and Claire Cardie

This papers advocates the use of RNNs for the task of opinion expression extraction. It aims to detect each word in the sentence either as DSE or ESE. Wait! this is escalating.

Opinion analysis can be thought as detecting sentiment of text (negative/positive/neutral), expression of text (hate, love …) and myriad other tasks related to detecting opinion and emotion in text, in this paper they are finding only DSE or ESE.

DSE is direct subjective expressions and ESE is expressive subjective expressions. To be more clear any sentence can either be interpreted in a direct (factual) or expressive (emotion-laden) way.

DSEs consist of explicit mentions of private states or speech events expressing private states; and ESEs consist of expressions that indicate sentiment, emotion, etc., without explicitly conveying them

Still a little confusing, so what is a private state:

As Quirk et al. (1985) define it, a private state is a state that is not open to objective observation or verification:

“a person may be observed to assert that God exists, but not to believe that God exists. Belief is in this sense ‘private’.”

Lets take the example from paper:

The committe, as usual, has refused to make any statements.

B_ESE= Begining token of ESE

I_ESE= Token inside ESE

B_DSE= Begining token of DSE

I_DSE= Token inside DSE

O = Token outside BSE/DSE

(This is BIO tagging scheme)

has refused to make any statements →explicitly expresses an opinion holder’s attitude so DSE

as usual → indirectly expresses the attitude of the writer so ESE.

(Thoroughly discussed in section 2.2 of Wiebe et al)

So in the data we have every word labeled with one of the BIO tag.

Methodology

Use RNNs. Why? Because they have memory.

Fun fact: RNNs come from 1990 (Elman).

In layman terms RNNs are neural network architecture which maintain a hidden state of their own, so when new input (for intution read it as “next word from sentence”) comes in, it remembers about what the previous words of sentence was conveying. Then it updates the hidden state again (based on current input and previous hidden state) and gives an output based on the same. These are trained using backpropagation.

This task particularly favours the use of RNNs because words would be labeled DSE/ESE based on the context. And the context is captured in the memory of the RNN.

Use bidirectionality. Why? Because you need to know about the next word to predict labels for current word. Let us take an example:

1. I did not accept his suggestion.

2. I did not go to the rodeo.

After passing through “I did” we have the same hidden state and even a human can’t say whether there is a DSE/ESE or not without looking at the whole sentence. Infact the first has a DSE phrase (“did not accept”) and the second does not. So, this case can not be handled by a RNN which takes input just in the forward direction (left to right in a sentence), so we use a bidirectional RNN which maintains two hidden states, one for forward direction other for backward direction.

Stack RNNs. Why? For more representative power.

How to stack? Just use previous layer’s hidden state as input.

What to output? Either you can output the final hidden states of all layers, or final hidden state of just the last layer. Authors have used the second approach because :

connecting the output layer to only the last hidden layer forces the architecture to capture enough high-level information at the final layer for producing the appropriate outputlayer decision.

Training

Objective function : standard multiclass cross-entropy

: standard multiclass cross-entropy SGD with fixed learning rate (0.005) and a fixed momentum rate (0.7).

fixed learning rate (0.005) and a fixed momentum rate (0.7). Weight update after minibatches of 80 sentences

after minibatches of 80 sentences Dropout for larger networks only.

Used precision, recall and F1 measure for performance evaluation

Performance Metric:

Binary Overlap: every overlapping match between a predicted and true expression is taken as correct.

Proportional Overlap: imparts a partial correctness, proportional to the overlapping amount, to each match.

After this? I would like to read more about CRF(conditional random fields).