Microsoft and Alibaba have developed deep neural network models that scored higher than humans in a Stanford University reading and comprehension test, Stanford Question Answering Dataset (SQuAD).

Microsoft achieved 82.650 on the ExactMatch (EM) metric* on Jan. 3, and Alibaba Group Holding Ltd. scored 82.440 on Jan. 5. The best human score so far is 82.304.

“SQuAD is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage,” according to the Stanford NLP Group. “With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.”

“A strong start to 2018 with the first model (SLQA+) to exceed human-level performance on @ stanfordnlp SQuAD’s EM metric!,” said Pranav Rajpurkar, a Ph.D. student in the Stanford Machine Learning Group and lead author of a paper in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing on SQuAD (available on open-access ArXiv). “Next challenge: the F1 metric*, where humans still lead by ~2.5 points!” (Alibaba’s SLQA+ scored 88.607 on the F1 metric and Microsoft’s r-net+ scored 88.493.)

However, challenging the “comprehension” description, Gary Marcus, PhD, a Professor of Psychology and Neural Science at NYU, notes in a tweet that “the SQUAD test shows that machines can highlight relevant passages in text, not that they understand those passages.”

“The Chinese e-commerce titan has joined the likes of Tencent Holdings Ltd. and Baidu Inc. in a race to develop AI that can enrich social media feeds, target ads and services or even aid in autonomous driving, Bloomberg notes. “Beijing has endorsed the technology in a national-level plan that calls for the country to become the industry leader 2030.”

Read more: China’s Plan for World Domination in AI (Bloomberg)

*”The ExactMatch metric measures the percentage of predictions that match any one of the ground truth answers exactly. The F1 score metric measures the average overlap between the prediction and ground truth answer.” – Pranav Rajpurkar et al., ArXiv