Introduction

Entity extraction from text is a major Natural Language Processing (NLP) task. As the recent advancement in the deep learning(DL) enable us to use them for NLP tasks and producing huge differences in accuracy compared to traditional methods.

I have attempted to extract the information from article using both deep learning and traditional methods. Result was amazing as DL method got accuracy of 85% over 65% from legacy methods.

The aim of the project is to tag each words of the articles into 4 categories: organisation, person, miscellaneous, and other. Then find the organisation and names most prominent in the article. The deep learning model tag each word into above 4 categories. Then a rule based approach to filter the unwanted tagging and finding most prominent names and organisation.

The code base is taken from Guillaume Genthial’s repo and all the credit goes to him for the work on NER. Also Guillaume Genthial blog has some low level detail on his NER project.

Thanks to Guillaume Genthial’s blog about sequence tagging work which is the backbone of this project.

High level architecture of the model

Architecture

This is the high level architecture of model that tag the words into each categories. I would like to explain each component of the model to give you high level understanding about the componets. In general, components are divided into three sections(based on Guillaume Genthial’s blog):

Word Representation: The first thing we can do is load some pre-trained word embeddings (GloVe). Also, we are extracting some meaning from the characters.

Contextual Word Representation: for each word in its context, we need to get a meaningful representation using LSTM

Decoding: Once we have a vector representing each word, we can use it to make a prediction.

Hot encoding(Words to numbers)

Deep learning models accept only numerical data as input rather than text. In order to use deep learning model for wide range of applications which are not numerical based, then input data needs to be converted into numerical form. This process is called hot encoding.

Here is the small example code how it’s done:

word_counts = Counter(words)

sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)

int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}

vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

Similartly we have to get all the character used in the input data and convert them to vectors for Character embedding.

Word Embedding & Character embedding

A word embedding is a learned representation for text where words that have the same meaning have a similar representation. Word embedding is usually done using neural networks where words or phrases from the vocabulary are mapped to vectors of real numbers.

However, generating word vectors for datasets can be computationally expensive(see my Github repo if you would like to play with it). The easy way to work around this is to use pretrained word embeddings, such as the GloVe vectors collected by researchers at Stanford NLP.

Character embedding is a vector representation of characters where it can derive word vectors. The main use of this embedding is a lot of entities don’t even have a pretrained word vector, so the word vector can be calculated from character vectors.There is a good online resource available to know about the Character embedding details.

LSTM

NN vs RNN

Recurrent Neural Networks(RNN) are a type of artificial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, the spoken word, or numerical times series data emanating from sensors, stock markets and government agencies. It can understand the contextual meaning of the text.

RNN neurons

LSTM is a type of recurrent neural network which can store more contextual information than simple recurrent neural network. Major difference between Simple RNN and LSTM lies in the architecture of each neuron.

For each word in its context, we need to get a meaningful representation using LSTM.

If you want to read more about LSTM and RNN in general, below mentioned article is useful:

Conditional Random Field(CRF)

We can use softmax as the final decoding step to predict the tags. When we use softmax, it give the probability of the word being in any of the classifications. But that method makes local choices. In other words, even if we capture some information from the context, the tagging decision is still local. We don’t make use of the neighbouring tagging decisions using softmax activation function.. For instance, in New York, the fact that we are tagging York as a location should help us to decide that New corresponds to the beginning of a location.

In CRFs, our input data is sequential, and we have to take previous context into account when making predictions on a data point. We make use of linear-chain CRF in this project. In linear-chain CRF, features to depend on only the current and previous labels, rather than arbitrary labels throughout the sentence.

To model this behavior, we will use Feature Functions, that will have multiple input values, which are going to be:

a sentence s

the position i of a word in the sentence

the label li of the current word

the label li−1 of the previous word

Next, assign each feature function fj a weight λj . Given a sentence s, we can now score a labeling l of s by adding up the weighted features over all words in the sentence:

Example Feature Functions based on POS tagging

f1(s,i,li,li−1)=1 if li= ADVERB and the ith word ends in “-ly”; 0 otherwise. If the weight λ1 associated with this feature is large and positive, then this feature is essentially saying that we prefer labelings where words ending in -ly get labeled as ADVERB.

f2(s,i,li,li−1)=1 if i=1, li= VERB, and the sentence ends in a question mark; 0 otherwise. if the weight λ2 associated with this feature is large and positive, then labelings that assign VERB to the first word in a question (e.g., “Is this a sentence beginning with a verb?”) are preferred.

f3(s,i,li,li−1)=1 if li−1= ADJECTIVE and li= NOUN; 0 otherwise. A positive weight for this feature means that adjectives tend to be followed by nouns.

f4(s,i,li,li−1)=1 if li−1= PREPOSITION and li= PREPOSITION. A negative weight λ4 for this function would mean that prepositions don’t tend to follow prepositions, so we should avoid labelings where this happens.

Finally, we can transform these scores into probabilities p(l|s) between 0 and 1 by exponentiating and normalizing:

To sum up, to build a conditional random field, you just define a bunch of feature functions (which can depend on the entire sentence, a current position, and nearby labels), assign them weights, and add them all together, transforming at the end to a probability if necessary. Basically, we need to do 2 things(based on Guillaume Genthial’s blog):

Find the sequence of tags with the best score. Compute a probability distribution over all the sequence of tags

Luckily, Tensorflow provided the library to do the CRF that makes easy for us to implement.

log_likelihood, transition_params=tf.contrib.crf.crf_log_likelihood(

scores, labels, sequence_lengths)

(The above is code is taken from Guillaume Genthial’s github repo)

CRF reads:

How model works

For each word, we want to build a vector that will capture the meaning and relevant features for our task. We’re gonna build this vector as a concatenation of the word embeddings from GloVe and a vector containing features extracted from the character level. One option is to use some kind of neural network to make this extraction automatically for us. In this post, we’re gonna use a bi-LSTM at the character level,

We hot-encode all the words in CONLL dataset which has an entry in Glove word embeddings. As mentioned, NN accepts only vectors, not text, so we have to convert them to vectors. The CONLL dataset contains words and it’s corresponding tags. After the hot encoding, both of them converted to vectors.

Code for hot encoding words and it’s tags:

with open(self.filename) as f:

words, tags = [], []

for line in f:

line = line.strip()

if (len(line) == 0 or line.startswith("-DOCSTART-")):

if len(words) != 0:

niter += 1

if self.max_iter is not None and niter > self.max_iter:

break

yield words, tags

words, tags = [], []

else:

ls = line.split(' ')

word, tag = ls[0],ls[-1]

if self.processing_word is not None:

word = self.processing_word(word)

if self.processing_tag is not None:

tag = self.processing_tag(tag)

words += [word]

tags += [tag]

(The above is code is taken from Guillaume Genthial’s github repo)

Code for pulling the vectors for words , tags, and characters:

if vocab_chars is not None and chars == True:

char_ids = []

for char in word:

# ignore chars out of vocabulary

if char in vocab_chars:

char_ids += [vocab_chars[char]]



if lowercase:

word = word.lower()

if word.isdigit():

word = NUM



if vocab_words is not None:

if word in vocab_words:

word = vocab_words[word]

else:

if allow_unk:

word = vocab_words[UNK]

else:

print(word)

print(vocab_words)



if vocab_chars is not None and chars == True:

return char_ids, word

else:

return word

(The above is code is taken from Guillaume Genthial’s github repo)

Now, let’s use tensorflow built-in functions to load the word embeddings. Assume that embeddings is a numpy array with our GloVe embeddings, such that embeddings[i] gives the vector of the i-th word.

L = tf.Variable(embeddings, dtype=tf.float32, trainable=False)

pretrained_embeddings = tf.nn.embedding_lookup(L, word_ids)

(The above is code is taken from Guillaume Genthial’s github repo)

Now, we can build the word embeddings from the characters. Here, we don’t have any pretrained character embeddings.

_char_embeddings = tf.get_variable(

name="_char_embeddings",

dtype=tf.float32,

shape=[self.config.nchars, self.config.dim_char])

char_embeddings = tf.nn.embedding_lookup(_char_embeddings,

self.char_ids_tensor, name="char_embeddings")



s = tf.shape(char_embeddings)

char_embeddings = tf.reshape(char_embeddings,

shape=[s[0]*s[1], s[-2], self.config.dim_char])

word_lengths = tf.reshape(self.word_lengths_tensor, shape=[s[0]*s[1]])



cell_fw = tf.contrib.rnn.LSTMCell(self.config.hidden_size_char,

state_is_tuple=True)

cell_bw = tf.contrib.rnn.LSTMCell(self.config.hidden_size_char,

state_is_tuple=True)

_output = tf.nn.bidirectional_dynamic_rnn(

cell_fw, cell_bw, char_embeddings,

sequence_length=word_lengths, dtype=tf.float32)

(The above is code is taken from Guillaume Genthial’s github repo)

Once we have our word representation, we simply run a bi-LSTM over the sequence of word vectors and obtain another sequence of vectors.

cell_fw = tf.contrib.rnn.LSTMCell(self.config.hidden_size_lstm)

cell_bw = tf.contrib.rnn.LSTMCell(self.config.hidden_size_lstm)

(output_fw, output_bw), _ = tf.nn.bidirectional_dynamic_rnn(

cell_fw, cell_bw, self.word_embeddings,

sequence_length=self.sequence_lengths_tensor, dtype=tf.float32)

output = tf.concat([output_fw, output_bw], axis=-1)

output = tf.nn.dropout(output, self.dropout_tensor)

(The above is code is taken from Guillaume Genthial’s github repo)

At this stage, each word is associated to a vector that captures information from the meaning of the word, its characters and its context. Let’s use it to make a final prediction. We can use a fully connected neural network to get a vector where each entry corresponds to a score for each tag.

W = tf.get_variable("W", dtype=tf.float32,

shape=[2*self.config.hidden_size_lstm, self.config.ntags])



b = tf.get_variable("b", shape=[self.config.ntags],

dtype=tf.float32, initializer=tf.zeros_initializer())



nsteps = tf.shape(output)[1]

output = tf.reshape(output, [-1, 2*self.config.hidden_size_lstm])

pred = tf.matmul(output, W) + b

self.logits = tf.reshape(pred, [-1, nsteps, self.config.ntags])

(The above is code is taken from Guillaume Genthial’s github repo)

Finally we use CRF method to find the tag of each words. Implementing a CRF only takes one-line! The following code computes the loss and also returns the trans_params that will be useful for prediction.

log_likelihood, _trans_params = tf.contrib.crf.crf_log_likelihood(

self.logits, self.labels_tensor, self.sequence_lengths_tensor)

self.trans_params = _trans_params

self.loss = tf.reduce_mean(-log_likelihood)

(The above is code is taken from Guillaume Genthial’s github repo)

And then, we can define our train operator as

optimizer = tf.train.AdamOptimizer(self.lr_tensor)

self.train_op = optimizer.minimize(self.loss)

(The above is code is taken from Guillaume Genthial’s github repo)

Once we define the model, run the model using dataset over few epoch to get the trained model.

How to use the trained model

Tensorflow provides ability to save the model weights so that it can be restored later. Whenever we run the prediction step, model weights are loaded so that it doesn’t need to train again.

def save_session(self):

"""Saves session = weights"""

if not os.path.exists(self.config.dir_model):

os.makedirs(self.config.dir_model)

self.saver.save(self.sess, self.config.dir_model)



def restore_session(self, dir_model):

self.saver.restore(self.sess, dir_model)

(The above is code is taken from Guillaume Genthial’s github repo)

Each article is feed into the model where it split into words, then go through the series process mentioned above to get the output. Final output from the model classify each word into 4 categories: organisation, person, miscellaneous, and other. As it has not 100% accuracy going through rule based approach to filter the result further to extract the names and organisation correctly which is most prominent in the article.

The article is based on Guillaume Genthial’s code base for the entity extraction, please visit the his github repo for the code in action. There is a slight changes based on the my project intention which can be found here.

If you like my write up, follow me on Github, Linkedin, and/or Medium profile.