Is Segmentation a Solved Problem?

An Exploration into the Methods and Difficulties of Text Segmentation

Many in the translation industry seem to think that sentence segmentation is a solved problem; however, I tend to disagree. While there are tools that can obtain high accuracy in specific languages and specific domains, there still does not exist a free, open-source tool that can handle many languages as well as handle ill-formatted content across any incoming domain.

On TM-Town translators upload documents across many different fields of expertise and many languages. As such, it is important that TM-Town's segmentation engine be able to handle many different languages as well as deal with potentially ill-formatted content (i.e. text imported from PDFs often has line-breaks that fall in the middle of sentences). To deal with these issues I have worked on developing a new sentence segmentation engine.

As TM-Town benefits from a lot of open-source technology I have decided to open source TM-Town's sentence segmentation library in hopes that it might benefit the community. TM-Town’s segmentation tool is called Pragmatic Segmenter and is a rule-based sentence boundary detection library written in Ruby.

The goal of Pragmatic Segmenter is to provide a "real-world" segmenter that works out of the box across many languages and does a reasonable job when the format and domain of the input text is unknown. Pragmatic Segmenter does not use any machine-learning techniques and thus does not require training data.

Pragmatic Segmenter aims to improve on other segmentation engines in 2 main areas:

Language support Text cleaning and preprocessing

What is sentence segmentation?

According to Wikipedia, sentence boundary disambiguation (aka sentence boundary disambiguation) sentence segmentation is defined as:

Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Languages like Japanese and Chinese have unambiguous sentence-ending markers.

Why is sentence segmentation important for translators?

Translation memory data is typically stored and exchanged at the segment level. Poor segmentation will at best give the translator extra work within their CAT tool of choice to fix the segmentation issues manually and at worst will reduce the usefulness of a translator’s translation memory.

Why is sentence segmentation important in the Natural Language Processing (NLP) community?

Although segmentation is not a "sexy" problem, it is very important and the base of many other NLP functions and tasks (e.g. machine translation, bitext alignment, named entity extraction, part-of-speech tagging, summarization, classification, etc.). As segmentation is often the first step needed to perform these NLP tasks, poor accuracy in segmentation can lead to poor end results.

Sentence segmentation methods

Typically sentence segmentation tools use one of the following methods:

Machine learning (unsupervised and supervised)

Rule-based

Tokenize-first group-later (e.g. Stanford CoreNLP)

TM-Town’s sentence segmenter uses a rule-based method that is made to work out-of-the-box across many languages. Users of the library do not need to manually create any rules, all the rules are already set up. Additionally TM-Town's library does not require any training data, which can be typical for libraries that utilize machine learning methods.

Language support

Typically segmentation tools focus almost exclusively on English. Sometimes they may also include language packs for a few other languages (i.e. German). As I mentioned above, TM-Town has translators from all over the globe that work in many different languages. While many languages have similar punctuation to English, some have completely different punctuation. If the segmentation engine is not built for these cases, it will fail spectacularly when given a language that does not use typical punctuation. For example, here are some languages that do not use the same sentence ending punctuation . ? ! as English:

Amharic full stop (።) question mark (፧)

Arabic question mark (؟)

Armenian full stop (:)

Burmese full stop (။)

Chinese full stop (。) question mark (？) exclamation point (！)

Greek question mark (;)

Hindi full stop (|)

Japanese full stop (。) question mark (？) exclamation point (！)

Persian question mark (؟)

Urdu question mark (؟)

The Golden Rules

The Golden Rules are a set of tests I developed that can be run through a segmenter to check its accuracy in regards to certain edge case scenarios. Most academic research papers that have studied segmentation have either used the WSJ corpus or Brown corpus from the Penn Treebank to test their segmentation algorithm. In my opinion there are 2 limits to using these corpora:

The corpora may be too expensive for some people ($1,700) The majority of the sentences in the corpora are sentences that end with a regular word followed by a period, thus testing the same thing over and over again

In the Brown Corpus 92% of potential sentence boundaries come after a regular word. The WSJ Corpus is richer with abbreviations and only 83% [53% according to Gale and Church, 1991] of sentences end with a regular word followed by a period. Andrei Mikheev Periods, Capitalized Words, etc.

Therefore, I created a set of distinct edge cases to compare segmentation tools on. As most segmentation tools have very high accuracy, in my opinion what is really important to test is how a segmenter handles the edge cases - not whether it can segment 20,000 sentences that end with a regular word followed by a period. These example tests I have named the "Golden Rules". This list is by no means complete and will evolve and expand over time. To view the Golden Rules visit TM-Town's Natural Language Processing resource page.

Results of the Golden Rule tests

The Holy Grail of sentence segmentation appears to be Golden Rule #18 as no segmenter I tested was able to correctly segment that text. The difficulty being that an abbreviation (in this case a.m./A.M./p.m./P.M.) followed by a capitalized abbreviation (such as Mr., Mrs., etc.) or followed by a proper noun such as a name can be both a sentence boundary and a non sentence boundary.

The results of how various popular segmentation tools faired against the Golden Rules test set can be found in the table below:

† GRS (Other Languages) is the total of the Golden Rules listed above for all languages other than English. This metric by no means includes all languages, only the ones that have Golden Rules listed above.

‡ Speed is based on the performance benchmark results detailed in the section "Speed Performance Benchmarks" below. The number is an average of 10 runs.

Other segmentation tools not yet tested:

In Conclusion

In my opinion sentence segmentation is not yet a solved problem, but hopefully TM-Town can help to continue pushing the community forward by working to improve segmentation accuracy across many languages. It is certain that more work still needs to be done to further improve accuracy rates across all languages and to add support for even more languages.

Live Demo

Give TM-Town's segmentation engine a try by adding some text below that you would like segmented. If you find any text that does not segment the way you would expect, please let me know by opening an issue. If you are a translator, be sure to try out TM-Town - it's free. You'll be able to see the results of the segmentation engine at work when you upload a document into TM-Town's system.

Language



Segment