One of Agolo’s main areas of focus is text summarization. Summarization is the process of taking a mess of unstructured information and distilling it into something consumable. Easier said than done. This formulation of the problem oversimplifies many considerations that a summarization system has to take into account. Implementing a system comes with choices, and with choices come tradeoffs.

This post lays out some of the most important choices. It also gives an overview of summarization in general. Each of the following sections discusses some of the insights we have gained and the approaches we take to tackle these problems at Agolo.

Abstractive vs. Extractive

When a human is given a corpus of text to summarize, they might rewrite the main points in their own words. This is how many professional human-generated summaries like Cliffs Notes operate. It’s called abstractive summarization.

Rewriting in different words requires many high-level skills that only human experts have. Simulating these skills requires coherent natural language generation, an extensive and deep knowledge of context, and adequate modeling of the reader’s mind. The state of the art is not yet up to par, so many automatic summarization systems opt for a technique called extractive summarization.

Extractive summaries are excerpts taken directly from the input documents and presented in a readable way. The summary does not contain any rephrasing of the ideas presented in the original text.

Some challenges in picking these sentences include:

processing the input text to make it presentable in the summary

determining which sentences are the most salient

making the summary cohesive and readable

minimizing the number of references to ideas and entities not mentioned in the summary

Agolo falls somewhere between an extractive and abstractive summarization service. Our system starts by picking the most salient sentences as a basis for a summary, and then takes into account an extensive knowledge graph and the discourse to provide context.

Single-document vs. Multi-document

Does the summarizer produce one summary per document, or does it distill multiple documents into a single summary?

When summarizing a single document, the summarization system can rely on a cohesive piece of text with very little repetition of facts. The author of a document would not reveal the same information more than once.

However, if the system summarizes multiple related documents, then it must ensure that the summary doesn’t contain repeated or conflicting information. The research literature sometimes formulates this as an optimization problem. An ideal multi-document summarizer maximizes the important information included in the summary while minimizing repetition.

How do you pick which documents to summarize together in the first place? In a large dataset, how do you know which documents are related and should be summarized together? Agolo solves this by clustering documents together into logical stories before summarizing them. The quality of our clusters ensure that the documents are about the same topic or event.

Indicative vs. Informative

The purpose of the summary is closely tied to its intended audience and their goals.

If the reader’s goal is to gain a cursory understanding of a new or large topic, then the summary needs to be indicative. It should give an overview of the content rather than dissect every aspect. This use case is for people like analysts or journalists who need to keep up on fast-moving news or explore unknown topics. They can choose to dive deeper into the text if they feel the need. But most of the time, the main points of a text is good enough for them.

Decision-makers, on the other hand, need detailed breakdowns of text. In this case, the summary should be informative. It should analyze in detail every topic covered in the text. The summary should almost be a replacement for the original text.

Depending on the use case, input documents, and other factors, Agolo’s summarization system provides both indicative and informative summaries. We provide personalized summaries tailored to each client’s needs. Personalization at this level is based on a close relationship with our users to understand their goals and concerns.

Generic vs. Query-based vs. Domain-specific

How users consume the summaries greatly influences summarization.

A summarization system with what’s called a generic trigger will find the most important topics in a given input text and summarize it without further guidance. For example, a system could produce a summary of the most important, real-time information about a hurricane as news articles are being published.

A generic trigger for summarization is useful in cases where the user does not yet know the contents of the text to be summarized. This is a challenging use case because it cannot rely on human intervention. The summary needs to present the most important topics, which might serendipitously provide the user with new knowledge. Identifying topics and determining which ones the author considers important are difficult challenges.

On the contrary, a query-based summary starts with a topic or question. A query-based summarization system may be given a large corpus of research papers to summarize the effects of a specific chemical compound on the environment. The system first needs to find only the papers that mention that chemical compound, identify the sections that mention its effects, and then summarize the salient points. The system needs to be able to discern the thread of each topic weaving through the text, and then provide a concise version of that discussion.

A summarizer could also use the text’s domain. For instance, a blog post about a recipe will use different jargon than a blog post about the acquisition of a tech company. A summary meant for experts in a domain can leverage taxonomies or a knowledge graph to take advantage of jargon.

For example, a financial analyst would not need the summary to define the word “acquisition.” This sense of the word means something different than the same word in an article about a museum acquiring paintings. Taking domain into account, the summarizer can expect sentences about mergers to be related to sentences about acquisitions.

Agolo’s summarization system is designed to handle text in a variety of domains depending on the client and features derived from the input text.

Genre and Other Factors

A summarization system works best when it knows in advance what kind of text it will encounter. Let’s call these formal properties.

The length of the input text heavily impacts the sort of approaches a summarization system can take. A typical news article can be summarized with conventional extractive summarization techniques. On the other hand, a 20-page report or a chapter of a book can only be summarized with the help of more advanced approaches like hierarchical clustering or discourse analysis.

Another consideration is the original form of the text: spoken vs. written. Spoken language is different from written language. Transcripts of spoken language are more likely to contain ungrammatical utterances with lots of repetitions.

A related consideration is the genre of the input documents. The discourse structure of an earnings call transcript is different from that of a 10-K filing. Dialog analysis is relevant to genres like chat logs, emails, and customer service phone call transcripts, which record conversations among several people. Dialog contains more topic shifts, interruptions, and anaphora than other genres.

A good summarization system can leverage genre information and other formal properties to produce useful summaries. Agolo’s summarizer takes these factors into account at various points in the summarization process.

In Summary

Summarization is more complicated than it seems.

When a machine generates a summary, it needs to take into consideration:

the level of the output summary’s abstractiveness

the number of input documents per summary

the purpose of the summary

the type of trigger for the summary

genre and other formal properties of the input text

We’ve proven with Agolo it’s possible to fall somewhere between these hard choices to best satisfy users’ needs. Our clients have a wide range of use cases that span all of these considerations and more. Our summarization system is designed to be flexible and extensible to suit these needs. This is one of the most challenging aspects of summarization at Agolo.

This post is only a high-level overview of these considerations. For further reading and a more detailed typology, I recommend the following resources: