One of my favorite activities, when I travel, is to listen to people as they pass and try to guess what language they’re speaking. I’d like to think that I’ve gotten pretty good at it over the years (though I rarely get to know if I guessed right).

If I’m lucky, I’ll recognize a word or phrase as a cognate of a language I’m familiar with, and narrow things down from there. Otherwise, I try to build up a phonetic inventory, listening for what kinds of sounds are present. For instance, is the speaker mostly using voiced alveolar trills ⟨r⟩, flaps ⟨ɾ⟩, or postalveolar approximants ⟨ɹ⟩? Are the vowels mostly open / close; front / back? Any unusual sounds, like ⟨ʇ⟩?

…or at least that’s what I think I do. To be honest, all of this happens unconsciously and automatically – for all of us, and for all manner of language recognition tasks. And have only the faintest idea of how we get from input to output.

Computers operate in a similar manner. After many hours of training, machine learning models can predict the language of text with accuracy far exceeding previous attempts from a formalized top-down approach.

Machine learning has been at the heart of natural language processing in Apple platforms for many years, but it’s only recently that external developers have been able to harness it directly.

New in iOS 12 and macOS 10.14, the Natural Language framework refines existing linguistic APIs and exposes new functionality to developers.

NLTagger is NSLinguistic Tagger with a new attitude. NLTokenizer is a replacement for enumerate Substrings(in:options:using:) (neé CFString Tokenizer ). NLLanguage Recognizer offers an extension of the functionality previously exposted through the dominant Language in NSLinguistic Tagger , with the ability to provide hints and get additional predictions.

Recognizing the Language of Natural Language Text

Here’s how to use NLLanguage Recognizer to guess the dominant language of natural language text:

import Natural Language let string = """ 私はガラスを食べられます。それは私を傷つけません。 """ let recognizer = NLLanguage Recognizer () recognizer . process String ( string ) recognizer . dominant Language // ja

First, create an instance of NLLanguage Recognizer and call the method process String(_:) passing a string. From there, the dominant Language property returns an NLLanguage object containing the BCP-47 language tag of the predicted language (for example "ja" for 日本語 / Japanese).

Getting Multiple Language Hypotheses

If you studied linguistics in college or joined the Latin club in high school, you may be familiar with some fun examples of polylingual homonymy between dialectic Latin and modern Italian.

For example, consider the readings of the following sentence:

CANE NERO MAGNA BELLA PERSICA!

Language Translation Latin Sing, o Nero, the great Persian wars! Italian The black dog eats a nice peach!

To the chagrin of Max Fisher, Latin isn’t one of the languages supported by NLLanguage Recognizer , so any examples of confusable languages won’t be nearly as entertaining.

With some experimentation, you’ll find that it’s quite difficult to get NLLanguage Recognizer to guess incorrectly, or even with low precision. Beyond giving it a single cognate shared across members of a language family, it’s often able to get past 2σ to 95% certainty with a handful of words.

After some trial and error, we were finally able to get NLLanguage Recognizer to guess incorrectly for a string of non-trivial length by passing the Article I of the Universal Declaration of Human Rights in Norsk, Bokmål:

let string = """ Alle mennesker er født frie og med samme menneskeverd og menneskerettigheter. De er utstyrt med fornuft og samvittighet og bør handle mot hverandre i brorskapets ånd. """ let language Recognizer = NLLanguage Recognizer () language Recognizer . process String ( string ) recognizer . dominant Language // da (!)

The Universal Declaration of Human Rights, is the among the most widely-translated documents in the world, with translations in over 500 different languages. For this reason, it’s often used for natural language tasks.

Danish and Norwegian Bokmål are very similar languages to begin with, so it’s unsurprising that NLLanguage Recognizer guessed incorrectly. (For comparison, here’s the equivalent text in Danish)

We can use the language Hypotheses(with Maximum:) method to get a sense of how confident the dominant Language guess was:

language Recognizer . language Hypotheses ( with Maximum : 2 )

Language Confidence Danish ( da ) 56% Norwegian Bokmål ( nb ) 43%

At the time of writing, the language Hints property is undocumented, so it’s unclear how exactly it should be used. However, passing a weighted dictionary of probabilities seems to have the desired effect of bolstering the hypotheses with known priors:

language Recognizer . language Hints = [ . danish : 0.25 , . norwegian : 0.75 ]

Language Confidence (with Hints) Danish ( da ) 30% Norwegian Bokmål ( nb ) 70%





So what can you do once you know the language of a string?

Here are a couple of use cases for your consideration:

Checking Misspelled Words

Combine NLLanguage Recognizer with UIText Checker to check the spelling of words in any string:

Start by creating an NLLanguage Recognizer and initializing it with a string by calling the process String(_:) method:

let string = """ Wenn ist das Nunstück git und Slotermeyer? Ja! Beiherhund das Oder die Flipperwaldt gersput! """ let language Recognizer = NLLanguage Recognizer () language Recognizer . process String ( string ) let dominant Language = language Recognizer . dominant Language ! // de

Then, pass the raw Value of the NLLanguage object returned by the dominant Language property to the language parameter of range Of Misspelled Word(in:range:starting At:wrap:language:) :

let text Checker = UIText Checker () let ns String = NSString ( string : string ) let string Range = NSRange ( location : 0 , length : ns String . length ) var offset = 0 repeat { let word Range = text Checker . range Of Misspelled Word ( in : string , range : string Range , starting At : offset , wrap : false , language : dominant Language . raw Value ) guard word Range . location != NSNot Found else { break } print ( ns String . substring ( with : word Range )) offset = word Range . upper Bound } while true

When passed the The Funniest Joke in the World, the following words are called out for being misspelled:

Nunstück

Slotermeyer

Beiherhund

Flipperwaldt

gersput

Synthesizing Speech

You can use NLLanguage Recognizer in concert with AVSpeech Synthesizer to hear any natural language text read aloud:

let string = """ Je m'baladais sur l'avenue le cœur ouvert à l'inconnu J'avais envie de dire bonjour à n'importe qui. N'importe qui et ce fut toi, je t'ai dit n'importe quoi Il suffisait de te parler, pour t'apprivoiser. """ let language Recognizer = NLLanguage Recognizer () language Recognizer . process String ( string ) let language = language Recognizer . dominant Language !. raw Value // fr let speech Synthesizer = AVSpeech Synthesizer () let utterance = AVSpeech Utterance ( string : string ) utterance . voice = AVSpeech Synthesis Voice ( language : language ) speech Synthesizer . speak ( utterance )

It doesn’t have the lyrical finesse of Joe Dassin, but ainsi va la vie.

In order to be understood, we first must seek to understand. And the first step to understanding natural language is to determine its language.