Teaching an AI to summarise news articles: A new dataset for abstractive summarisation

We are open-sourcing 40,000 professionally-written summaries of news articles. Instructions for how to access the dataset can be found in our Github repository, along with examples of us using the dataset for fine-tuning.

Background

Automatic text summarisation is one of those research topics that has been around since the early days of computational linguistics and is still receiving a lot of interest from the research community, as it is far from being considered a solved problem.

― Constantin Orasan Automatic summarisation: 25 years On (doi:10.1017/S1351324919000524)

Getting a computer to summarise a long document is a problem that dates back to the earliest days of Natural Language Processing (NLP), with statistical attempts in the late 1950s, the empiricist approaches of the 70s, machine learning techniques in the 90s, finally leading to the increasingly popular deep learning methods being used at the moment.

Broadly speaking, there are two computational approaches to the problem: extractive and abstractive. The former takes a document[s] and, using an algorithm, extracts what it deems the most relevant sections to produce a summary, without modifying the wording of the original.

In contrast, abstractive summarisation, takes the input document and tries to write a coherent, plausible and factually correct summary.

Abstractive summarisation is one of the hardest problems in natural language processing. An ideal summariser would not only be able to generate coherent text, it would also understand which information can be lifted directly from the source text, and which parts can be paraphrased. State-of-the-art (SOTA) models therefore don’t rely exclusively on translating between the input document and summary, but incorporate copy mechanisms and extractive summarisation tasks in training.

Corpora

There are practical difficulties with building abstractive summarisation models as well. One is the sheer volume of information that has to be processed. Many NLP tasks work on the sentence or paragraph level. But for summarisation, an entire document is de rigueur. Such large input sizes force us to limit our batch sizes and can make training tedious and expensive.

There are relatively few datasets for abstractive summarisation, let alone good ones, and fewer still are publicly available. When training models for translation between languages we can draw on sources like books or the proceedings of the European parliament. For summarisation though, there has not been much impetus for people to make significant numbers of high quality abstractive summaries available for public consumption, particularly given the expense that goes into producing them.

Historically, datasets have been generated using clever hacks. For instance, we might use the sections of a Wikipedia article below the table of contents as our input, and the section above as our target output. Even the most popular dataset for abstractive summarisation, the CNN/Daily Mail dataset, is only able to use the subtitles of its articles for a target output. Further still, the text found in these implied summaries may often be noisy due to scraping inaccuracies, around 0.5% of the CNN/Daily Mail dataset has been found to have such errors, as shown below.

Curation Dataset

For the last five years at Curation, we have been working to provide companies with the information that they most need to see. We keep abreast of the latest news developments across a range of industries and as part of this service we have a team of professional abstractors writing summaries of news articles.

We are open sourcing 40,000 human-written abstracts of news articles and links to the sources under a Creative Commons license for NLP researchers and data scientists to explore and build on.

We believe (and our clients seem to agree) these summaries are of an excellent quality, and we’re excited to see how the NLP community can utilise them. Although we cannot include the original article body, we are also releasing a script alongside the that can download and parse the sources for personal use.

What do we mean by excellent quality? As the CNN/Daily Mail corpus’ targets are the subtitles of various news articles, they often assume that the reader has already seen the piece’s headline. This means that taken in isolation the subtitles can feel like they are missing their first sentence. In our experiments we have found that state-of-the-art approaches replicate this in their predictions.

In contrast, Curation’s abstracts are specifically written by professional copywriters to stand alone a priori as an intelligible piece of content in their own right. Our writing team conform to a comprehensive internal style-guide and each piece is proofed twice by our editorial and content curation teams to ensure factual and stylistic accuracy. Our abstracts are on average 40 words longer than other publicly available datasets and are designed to maximise information density whilst still being a joy to read.

Want to try it out and see for yourself? Click here to download the Curation Abstracts Dataset.