In the internet age, when we face a language barrier, there are a host of internet resources to solve it: things like translation apps, dictionary websites, versions of Wikipedia in other languages, and the simple "click to translate" option. But there are about 7000 languages spoken in the world today. The top 10 or so are spoken by hundred of millions of speakers; the bottom third have 1000 speakers or fewer.

Gretchen McCulloch is WIRED's resident linguist. She's the cocreator of Lingthusiasm, a podcast that's enthusiastic about linguistics. Her book Because Internet: Understanding the New Rules of Language is due out in July 2019 from Penguin.

But in the murky middle ground are a couple hundred languages that are spoken by speakers in millions. These midsize languages are still fairly widely spoken, but they have vastly inconsistent levels of support online. There’s Swedish, which has 9.6 million speakers, the third-largest Wikipedia with over 3 million articles, and support in Google Translate, Bing Translate, Facebook, Siri, YouTube captions, and so on. But there’s also Odia, the official language of the Odisha state in India, with 38 million speakers, which has no presence in Google Translate. And Oromo, a language spoken by some 34 million people, mostly in Ethiopia, which has just 772 articles in its Wikipedia.

Why do Greek, Czech, Hungarian, and Swedish, with their 8 to 13 million speakers, have Google Translate support and robust Wikipedia presences, while languages the same size or larger, like Bhojpuri (51 million), Fula (24 million), Sylheti (11 million), Quechua (9 million), and Kirundi (9 million) languish in technological obscurity?

Part of the reason is that Greek, Czech, Hungarian, and Swedish are among the 24 official languages of the European Union, which means that a small hoard of human translators translate many official European Parliament documents every year. Human-translated documents make a great base for what linguists call a parallel corpus — a large mass of text that's equivalent, sentence-by-sentence, in multiple languages. Machine translation engines use parallel corpora to figure out regular correspondences between languages: if "regering" or "κυβέρνηση" or "kormány" or "vláda" all frequently appear in parallel to "government," then the machine concludes these words are equivalent.

In order to be reasonably effective, machine translation requires an enormous parallel corpus for each language. Ideally, this corpus contains documents from a variety of genres: not just parliamentary proceedings but news reports, novels, film scripts, and so on. The machine can't translate informal social media posts very well if it's been trained only on formal legal documents. Translation tools are already scraping the bottom of the parallel corpus barrel: In many languages, the largest parallel translated text is the Bible, which leads to peculiar circumstances where Google translates nonsense syllables into prophecies of doom.

Translation tools are already scraping the bottom of the parallel corpus barrel: In many languages, the largest parallel translated text is the Bible.

In addition to EU documents, Swedish, Greek, Hungarian, and Czech have a wealth of language resources, created one human at a time over centuries. They're the languages of entire nation-states, with national TV and radio recordings that can be used as the foundation for text-to-speech models. Their speakers have the kind of disposable income that makes media companies translate popular novels and subtitle foreign movies and TV shows. They're found in countries that tech companies imagine their customers might be living in or might at least visit on holiday, meaning it's worth localizing interfaces and adding them as translation options. They have regularized spelling systems and dictionaries that can be rolled into spellcheckers and predictive text models. They have highly literate speakers with internet access who can contribute to projects like Wikipedia. (Speakers who can even, in the case of Swedish, create a bot to automatically make basic Wikipedia articles for rivers, mountains, and other natural features.)

Language resources don't just appear. People have to decide to create them, and those people need to be fed and watered and educated and housed and supported, whether that's by governments or by companies or by the kind of personal wealth that lets individuals take on time-consuming intellectual hobbies. Creating parallel corpora and other language resources takes years, if it happens at all, and cost tens of millions of dollars per language.

Meanwhile, we know that catastrophes periodically happen around the world: earthquakes, floods, hurricanes, cyclones, diseases, famines, fires. Some of them will happen in areas where people speak a large, well-resourced language, and organizations will rush to their aid. But the odds are good that some of the world's future crises will happen in areas where people speak one of these medium-size but low-resource languages. In those cases, aid organizations and governments will face an urgent language barrier.