Certain English Errors May Decipher Clues To Dying Languages

Linguists try to understand the nuances of languages, and how they relate to one another. A computer scientist says the English mistakes of non-native speakers can reveal something about languages.

STEVE INSKEEP, HOST:

Anything you write or say has two parts to it - what you say and how you say it. Computers can now pay attention to how we express ourselves. A computer scientist says software can use that information to figure out something about the way our brains process language, which could lead to an even deeper knowledge, as we're about to learn from NPR's social science correspondent Shankar Vedantam. Hi, Shankar.

SHANKAR VEDANTAM, BYLINE: Good morning, Steve.

INSKEEP: OK, so what's going on here?

VEDANTAM: Well, I spoke with a computer scientist at MIT. His name is Boris Katz. He's from Russia, and he noticed that he and other native Russian speakers make similar mistakes when they speak in English. And the reason of course is that the rules of the Russian language have been burned into the brains of native Russian speakers, and those patterns are surfacing when they speak in English. Katz explained this idea with an example.

BORIS KATZ: In Russian language, there are no articles. These are the words like the and a and so forth. And as a result, native speakers of Russian would say something like, I was champion of swimming competition in Russia, instead of, I was the champion of a swimming competition in Russia.

INSKEEP: OK, I get this. He's saying that people who learned a different language first will make the same mistakes when they move over into English.

VEDANTAM: That's exactly right. Now, Katz asked a simple question; if native speakers of a foreign language make similar mistakes when they speak in English, can you deduce the properties of that foreign language by analyzing the mistakes they make when they speak in English? So Katz, Yevgeni Berzak and Roi Reichart analyzed thousands of English-language essays by native speakers of 14 languages. Actually they didn't analyze the essays; they programmed a computer to analyze the essays. They told the computer the native language of the writers, and the computer quickly made connections. For example, the Russian speakers tended to omit articles in their English essays. So without knowing anything about these other languages, the computer was now able to say, these are some of the characteristics of this foreign language.

INSKEEP: I guess this must get to the point then where you can put in an English-language essay of someone who's not a native English speaker with no information and figure out what the other language was.

VEDANTAM: That's exactly right, Steve, and initially Katz thought this would be a way to allow computers to detect the native languages of different speakers. So once the computer learns enough about the characteristics of Russian, you can have the computer analyze Facebook posts, for example, and the computer would say, the writer of this is probably a native Russian speaker.

INSKEEP: Figure out my immigration history, for example.

VEDANTAM: Exactly. But when he had the computer actually do this, Katz found something interesting. The computer was making mistakes. It would say this person's language is Russian, when really their native language was Polish. At first Katz thought that this was a bug, and then he realized this might actually be a feature because the languages that the computer were mixing up, these were languages that had very similar structures and rules of grammar, and that's precisely why the computer was mixing them up. And in fact the errors the computer were making were effectively grouping together similar languages. And if you organize the languages the computer mixed up into a tree, this tree looked identical to the tree that linguists have been constructing over many decades.

KATZ: I literally couldn't believe my eyes. The tree very neatly placed Russian and Polish on the same branch and French and Italian on another branch and Spanish and Portuguese on yet another. You know, Russian and Polish are Slavic languages, Spanish and Portuguese are Romance and so on.

INSKEEP: Ok, kind of cool here, but this is something that people already understood. Is there something more you can learn from this?

VEDANTAM: Yes, I think so because at one level, the computer is verifying what it is that linguists have already found. And that's interesting and it's cool, but it's not particularly earth-shattering. But this also raises the possibility the computer can actually start to make discoveries about languages. So by analyzing the patterns of mistakes that native speakers of two languages make in English, the computer can say, look, these two languages might actually be related to one another because the structures of these languages are actually similar.

This might be especially powerful for languages that are on the brink of extinction. Many of the world's languages are disappearing, and it's hard for linguists to study these languages because there are very few native speakers left or maybe those native speakers have themselves forgotten this distant language. But if those dying languages have left traces in the brains of some of those speakers and those traces show up in the mistakes those speakers make when they're speaking and writing in English, we can use the errors to learn something about those disappearing languages. So in effect, you're using the way people speak English as an archaeological window into other languages and the history of language.

INSKEEP: Shankar, thanks.

VEDANTAM: Thank you, Steve.

INSKEEP: No doubt somewhere in my brain, there are traces of the Latin I learned in Mrs. Whittaker's (ph) eighth grade class. NPR social science correspondent Shankar Vedantam. You can follow him on Twitter at @HiddenBrain, and you can follow this program at @MorningEdition, also at @nprAudie and at @NPRinskeep.

Copyright © 2014 NPR. All rights reserved. Visit our website terms of use and permissions pages at www.npr.org for further information.

NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.