Seventy-one percent of today’s internet users don’t speak English as a first language, and that number keeps growing. But few people specialize in internationalization. As a result, most sites get it wrong—because things that seem straightforward are often anything but.

Article Continues Below

Take pluralization. Turning singular words into plurals within strings gets tricky quickly—even in English, where most plural words end with an s. For instance, I worked on a photo-sharing app that supported two languages, English and Chinese. It was easy to add an s to display “X like[s]” or “Y comment[s].” But what if we needed to pluralize “foot” or “inch” or “quiz”? Our simple solution became a broken hack.

And English is a relatively simple case. Many languages have more than two plural forms: Arabic, for example, has six, and many Slavic languages have more than three. In fact, at least 39 languages have more than two plural forms. Some languages only have one form, such as Chinese and Japanese, meaning that plural and singular nouns are the same.

How can we make sense of these complex pluralization issues—and solve them in our projects? In this article, I’ll show you some of the most common pluralization problems, and explain how to overcome them.

Problems with pluralization#section2

Pluralization gets even more complex: each language also has its own rules for defining each plural form. A plural rule defines a plural form using a formula that includes a counter. A counter is the number of items you’re trying to pluralize. Say we’re working with “2 rabbits.” The number before the word “rabbits” is the counter. In this case, it has the value 2. Now, if we take the English language as an example, it has two plural forms: singular and plural. Therefore, our rules look like this:

If the counter has the integer value of 1, use the singular : “rabbit.”

: “rabbit.” If the counter has a value that is not equal to 1, use the plural: “rabbits.”

However, the same isn’t true in Polish, where the same word—“rabbit,” or “królik”—can take more than two forms:

If the counter has the integer value of 1, use “ królik .”

.” If the counter has a value that ends in 2–4, excluding 12–14, use “ królika .”

.” If the counter is not 1 and has a value that ends in either 0 or 1, or the counter ends in 5–9, or the counter ends in 12–14, use “ królików .”

.” If the counter has any other value than the above, use “króliki.”

So much for “singular” and “plural.” For languages with three or more plural forms, we need more specific labels.

Different languages use different types of numbers#section3

You may also want to display the counter along with the pluralized noun, such as, “You have 3 rabbits.” However, not all languages use the Arabic numbers you may be accustomed to—for example, Arabic uses Arabic Indic numbers, ٠١٢٣٤٥٦٧٨٩:

0 books: ٠ كتاب

1 book: كتاب

3 books: ٣ كتب

11 books: ١١ كتابًا

100 books: ١٠٠ كتاب

Different languages or regions use different number formats#section4

We also often aim to make large numbers more readable by adding separators, as when we render the number 1000 as “1,000” in English. But many languages and regions use different fractional and thousand separators. For example, German renders the number 1000 as “1.000.” Other languages don’t group numbers by thousands, but rather by tens of thousands.

Solution: ICU ’s MessageFormat#section5

Pluralization is a complex problem to solve—at least, if you want to handle all these edge cases. Recently, International Components for Unicode (ICU) did precisely that with MessageFormat. ICU’s MessageFormat is a markup language specifically tailored to localization. It allows you to define, in a declarative way, how nouns should be rendered in various plural forms. It sorts all the plural forms and rules for you, and formats numbers correctly. Unfortunately, many of you probably haven’t heard of MessageFormat yet, because it’s mostly used by people who work specifically with internationalization—known to insiders as i18n—and JavaScript has only recently evolved to handle it.

Let’s talk about how it works.

Using CLDR for plural forms#section6

CLDR stands for Common Locale Data Repository, and it’s a repo that companies like Google, IBM, and Apple draw on to get information about number, date, and time formatting. CLDR also contains data on the plural forms and rules for many languages. It’s probably the largest locale data repository in the world, which makes it ideal as the basis for any internationalization JavaScript tool.

CLDR defines up to six different plural forms. Each form is assigned a name: zero, one, two, few, many, or other. Not all locales need every form; remember, English only has two: one and other. The name of each form is based on its corresponding plural rule. Here is a CLDR example for the Polish language—a slightly altered version of our earlier counter rules:

If the counter has the integer value of 1, use the plural form one .

. If the counter has a value that ends in 2–4, excluding 12–14, use the plural form few .

. If the counter is not 1 and has a value that ends in either 0 or 1, or the counter ends in 5–9, or the counter ends in 12–14, use the plural form many .

. If the counter has any other value than the above, use the plural form other.

Instead of manually implementing CLDR plural forms, you can make use of tools and libraries. For example, I created L10ns, which compiles the code for you; Yahoo’s FormatJS has all the plural forms built in. The big benefits of these tools and libraries are that they scale well, as they abstract the plural-form handling. If you choose to hard-code these plural forms yourself, you will end up exhausting yourself and your teammates, because you’ll need to keep track of all the forms and rules, and define them over and over whenever and wherever you want to format a plural string.

MessageFormat is a domain-specific language that uses CLDR , and is specifically tailored for localizing strings. You define markup inline. For example, we want to format the message “I have X rabbit[s]” using the right plural word for “rabbit”:

var message = 'I have {rabbits, plural, one{# rabbit} other{# rabbits}}';

As you can see, a plural format is defined inside curly brackets {} . It takes a counter, rabbits , as the first argument. The second argument defines which type of formatting. The third argument includes CLDR ’s plural form ( one , many ). You need to define a sub-message inside the curly brackets that corresponds to each plural form. You can also pass in the symbol # to render the counter with the correct number format and numbering system, so it will solve the problems we identified earlier with the Arabic Indic numbering system and with number formatting.

Here we parse the message in the en-US locale and output different messages depending on which plural form the variable rabbits takes:

var message = 'I have {rabbits, plural, one{# rabbit} other{# rabbits}}.'; var messageFormat = new MessageFormat('en-US'); var output = messageFormat.parse(message); // Will output "I have 1 rabbit." console.log(output({ rabbits: 1 })); // Will output "I have 10 rabbits." console.log(output({ rabbits: 10 }));

Benefits of inlining#section8

As you can see in the preceding message, we defined a plural format inline. If it weren’t inlined, we might need to repeat the words “I have…” for all plural forms, instead of just typing them once. Imagine if you needed to use even more words, as in the following example:

{ one: 'My name is Emily and I got 1 like in my latest post.' other: 'My name is Emily and I got # likes in my latest post.' }

Without inlining, we’d need to repeat “My name is Emily and I got…in my latest post” every single time. That’s a lot of words.

In contrast, inlining in ICU ’s MessageFormat simplifies things. Instead of repeating the phrase for every plural form, all we need to do is localize the word “like”:

var message = 'My name is Emily and I got {likes, plural, one{# like} other{# likes}} in my latest post';

Here we don’t need to repeat the words “My name is Emily and I got…in my latest post” for every plural form. Instead, we can simply localize the word “like.”

Benefits of nesting messages#section9

MessageFormat’s nested nature also helps us by giving us endless possibilities to define a multitude of complex strings. Here we define a select format in a plural format to demonstrate how flexible MessageFormat is:

var message = '{likeRange, select,\ range1{I got no likes}\ range2{I got {likes, plural, one{# like} other{# likes}}}\ other{I got too many likes}\ }';

A select format matches a set of cases and, depending on which case it matches, it outputs the corresponding sub-message. And it is perfect to construct range-based messages. In the preceding example, we want to construct three kinds of messages for each like range. As you can see in range2 , we defined a plural format to format the message “I got X like[s],” and then nested the plural format inside a select format. This example showcases a very complex formatting that very few syntaxes can achieve, demonstrating MessageFormat’s flexibility.

With the above format, here are the messages we can expect to get:

“I got no likes,” if likeRange is in range1 .

is in . “I got 1 like,” if likeRange is in range2 and the number of likes is 1.

is in and the number of likes is 1. “I got 10 likes,” if likeRange is in range2 and the number of likes is 10.

is in and the number of likes is 10. “I got too many likes,” if likeRange is in neither range1 nor range2 .

These are very hard concepts to localize—even one of the most popular internationalization tools, gettext, can’t do this.

Storage and pre-compiled messages#section10

However, instead of storing MessageFormat messages in a JavaScript variable, you might want to use some kind of storage format, such as multiple JSON files. This will allow you to pre-compile the messages to simple localization getters. If you don’t want to handle this alone, you might try L10ns, which handles storage and pre-compilation for you, as well as syncing translation keys between source and storage.

Do translators need to know MessageFormat?#section11

You might think it would be too overwhelming for non-programming translators to know Messageformat and CLDR ’s plural form. But in my experience, teaching them the basics of how the markup looks and what it does, and what CLDR ’s plural forms are, takes just a few minutes and provides enough information for translators to do their job using MessageFormat. L10ns’ web interface also displays the example numbers for each CLDR plural form for easy reference.





Pluralization isn’t easy—but it’s worth it#section12

Yes, pluralization has a lot of edge cases that aren’t easily solvable. But ICU ’s MessageFormat has helped me tremendously in my work, giving me endless flexibility to translate plural strings. As we move to a more connected world, localizing applications to more languages and regions is a must-do. Knowledge about general localization problems and tools to solve them are must-haves. We need to localize apps because the world is more connected, but we can also localize apps to help make the world more connected.