The Neural Network

Until now, various models have been proposed for the task of extractive text summarization. Most of them have treated this as a classification problem that outputs whether a sentence should be included in the summary or not. They compare each sentence with every other one to select the most commonly-used words and give a score to each sentence on this basis.

A threshold score is decided depending upon the length of the summary required, and every sentence having a higher score is then included in the summary. This is generally done using a Standard Naïve Bayes Classifier or Support Vector Machines.

For genre-specific summarization (medical reports or news articles), engineering-based models or models that are trained using articles of the same genre have been more successful, but these techniques give poor results when used for general text summarization.

Flow Diagram depicting traditional models

What if we could use a fully data-driven approach to train a feedforward neural network that gives reliable results irrespective of the genre of the document? A simple model consisting of one input layer, one hidden, and one output layer can be used for this task. This model would be able to generate a summary of arbitrarily-sized documents by breaking them into fixed-size parts and feeding them recursively to the network.

If you see the paragraph we summarized earlier, you’ll see that the summary contains the exact same sentences as in the original document were used:

However, we cannot afford to ignore the possibilities that AI offers us for dramatically improving the student learning experience.The resistance faced by this new technology has to decrease because AI can not only help the teachers be more productive, but also make them more responsive towards the needs of the students.

These two sentences were selected by the model as most relevant to be included in the summary. The document is fed to the input layer, all the computation is done in the hidden layer, and an output is generated at the output layer as probability vectors, which determine whether a sentence is to be included into the summary or not.

A fixed number of sentences are selected from every run depending on the size of the summary required. Since the input to the network must be numbers, we need a way to convert sentences to a numerical form.

The best way of doing this is to convert the words to vector representations using the Word2Vec library. A tutorial for the same can be found here. Furthermore, these word vectors can be used to convert our sentences to vectors of fixed dimensions.

A simple word2vec model

The biggest problem that this model faces is that if we convert each sentence to a fixed dimension vector, the length of the documents would vary widely. The size of the input layer is fixed and cannot be varied depending on the document length. To solve this problem, we use a completely new approach.

We divide the document into segments, each having a fixed number of sentences — let’s call these segments “pages” and the fixed number of sentences “length”. For each run of the network, we feed it with a single “page” and get a summary of that page. After the last run we concatenate all the summaries.

For pages that have less sentences than the “length”, the vector can be padded with zeroes. Another advantage of this approach is that we can test the model with different “lengths” and see which gives the best results.

This proposed model can prove extremely useful for generating extractive summaries, and this could run on devices with limited computational power like mobile phones.

Flow Diagram of the model

Training and Evaluation

For training this model, DUC (Document Understanding Conferences) datasets can be used. These datasets include document-summary pairs, with each document having two summaries (both extractive). The only pre-processing required would be to convert them to text documents, as they are provided as XML pages.

The best way to evaluate this model would be to use ROUGE. It stands for Recall-Oriented Understudy for Gisting Evaluation. To evaluate the neural network, ROUGE compares the summaries generated by the network to human-generated summaries. This is the reason why it’s used extensively for evaluating automatic summaries and sometimes also for machine translations.