In the fall of 2017, Sam Bowman, a computational linguist at New York University, figured that computers still weren’t very good at understanding the written word. Sure, they had become decent at simulating that understanding in certain narrow domains, like automatic translation or sentiment analysis (for example, determining if a sentence sounds “mean or nice,” he said). But Bowman wanted measurable evidence of the genuine article: bona fide, human-style reading comprehension in English. So he came up with a test.

Original story reprinted with permission from Quanta Magazine, an editorially independent publication of the Simons Foundation whose mission is to enhance public understanding of science by covering research develop­ments and trends in mathe­matics and the physical and life sciences.

In an April 2018 paper coauthored with collaborators from the University of Washington and DeepMind, the Google-owned artificial intelligence company, Bowman introduced a battery of nine reading-comprehension tasks for computers called GLUE (General Language Understanding Evaluation). The test was designed as “a fairly representative sample of what the research community thought were interesting challenges,” said Bowman, but also “pretty straightforward for humans.” For example, one task asks whether a sentence is true based on information offered in a preceding sentence. If you can tell that “President Trump landed in Iraq for the start of a seven-day visit” implies that “President Trump is on an overseas visit,” you’ve just passed.

The machines bombed. Even state-of-the-art neural networks scored no higher than 69 out of 100 across all nine tasks: a D-plus, in letter grade terms. Bowman and his coauthors weren’t surprised. Neural networks — layers of computational connections built in a crude approximation of how neurons communicate within mammalian brains — had shown promise in the field of “natural language processing” (NLP), but the researchers weren’t convinced that these systems were learning anything substantial about language itself. And GLUE seemed to prove it. “These early results indicate that solving GLUE is beyond the capabilities of current models and methods,” Bowman and his coauthors wrote.

Their appraisal would be short-lived. In October of 2018, Google introduced a new method nicknamed BERT (Bidirectional Encoder Representations from Transformers). It produced a GLUE score of 80.5. On this brand-new benchmark designed to measure machines’ real understanding of natural language — or to expose their lack thereof — the machines had jumped from a D-plus to a B-minus in just six months.

LEARN MORE The WIRED Guide to Artificial Intelligence

“That was definitely the ‘oh, crap’ moment,” Bowman recalled, using a more colorful interjection. “The general reaction in the field was incredulity. BERT was getting numbers on many of the tasks that were close to what we thought would be the limit of how well you could do.” Indeed, GLUE didn’t even bother to include human baseline scores before BERT; by the time Bowman and one of his Ph.D. students added them to GLUE in February 2019, they lasted just a few months before a BERT-based system from Microsoft beat them.

As of this writing, nearly every position on the GLUE leaderboard is occupied by a system that incorporates, extends or optimizes BERT. Five of these systems outrank human performance.

But is AI actually starting to understand our language — or is it just getting better at gaming our systems? As BERT-based neural networks have taken benchmarks like GLUE by storm, new evaluation methods have emerged that seem to paint these powerful NLP systems as computational versions of Clever Hans, the early 20th-century horse who seemed smart enough to do arithmetic, but who was actually just following unconscious cues from his trainer.

“We know we’re somewhere in the gray area between solving language in a very boring, narrow sense, and solving AI,” Bowman said. “The general reaction of the field was: Why did this happen? What does this mean? What do we do now?”

Writing Their Own Rules

In the famous Chinese Room thought experiment, a non-Chinese-speaking person sits in a room furnished with many rulebooks. Taken together, these rulebooks perfectly specify how to take any incoming sequence of Chinese symbols and craft an appropriate response. A person outside slips questions written in Chinese under the door. The person inside consults the rulebooks, then sends back perfectly coherent answers in Chinese.

The thought experiment has been used to argue that, no matter how it might appear from the outside, the person inside the room can’t be said to have any true understanding of Chinese. Still, even a simulacrum of understanding has been a good enough goal for natural language processing.