Can AI master eighth-grade science?

By Oren Etzioni and Carissa Schoenick. Allen Institute for AI

Just how intelligent is today’s artificial intelligence? Interestingly enough, the answer may lie in standardized tests like the SATs, IQ tests, science tests, and more. Using standardized tests with multiple choice questions, graphs, and mathematical elements is a strong basis for quantitatively assessing intelligence---human or artificial.

This fall, we launched the The Allen AI Science Challenge, which invites participants to build software that can automatically answer eighth-grade multiple-choice science questions, with the goal of assessing the state of the art in Natural Language Understanding and reasoning.

Everyone knows computers are in the business of facts: Databases of information of every imaginable type have been built, analyzed and shared across the world for decades. But pose a simple multiple choice science question to a computer:

Which part of the eye does light hit first?

(A) the retina

(B) the lens

(C) the cornea

(D) the pupil

...and what do you get? Even for a system trained to understand the structure of a question like this, the answer is deceptively difficult to produce.

The complexity in this example lies in the nuance of language. What does it mean for light to “hit” something? How do you apply the notion of “first” to an eyeball? Whose eye are we talking about here, anyway? Building an AI system to successfully tackle questions like these, and therefore achieve what we might call a “true understanding” of basic science, requires more than just a database of hard facts. It will also require a way to represent elements of the unstated, common-sense knowledge about the world and the way humans experience it. This knowledge is typically generated through experience over a lifetime, and it provides the essential background context needed to successfully understand and answer this question and others like it.

Measuring AI: Why science exam questions?

The famous Turing test for AI proposes that if a system appears to exhibit intelligent behavior indistinguishable from that of a human during a natural-language conversation, it could be considered truly “artificially intelligent.” This approach is very game-able, however, and in dire need of revisiting. John Markoff noted that “the Turing test has become a test of human gullibility,” and Gary Marcus further elaborates on how the Turing test fails to truly assess whether an AI can understand human knowledge in his interview with Arun Rath: Moving Beyond the Turing Test to Judge Artificial Intelligence. The “Beyond the Turing Test” workshop held at the AAAI conference in January of 2015 also took steps toward engaging the community to provide input on the eventual replacement tests for better assessing the success of a given AI system.

Unlike the Turing test, science test questions represent a more accessible, measurable, and widely understandable benchmark to use in AI development. Posing questions like this to an artificially intelligent system provides a way to directly measure that system’s ability to understand questions and reason about basic scientific concepts, as well as a way to directly compare its abilities with that of a human in a much more objective way than the Turing test can accomplish.

The Allen AI Science Challenge

The Allen Institute for Artificial Intelligence (AI2) is dedicated to the mission of AI for the common good; building and sharing resources and tools with the wider community to help advance the field of AI in several important areas. To encourage the community to think about new ways to measure the advancement of AI, the team designed and launched a competition on Kaggle.com in October of 2015. The Allen AI Science Challenge invites participants to try their hand at building a model that can correctly answer eighth grade level multiple choice science questions, just like the example above.

Teams can use any strategy they like so long as the resources or tools they incorporate into their systems are freely available and open source. The final winning models will be open source and available to the research community. There are prizes for the top three teams, with a first prize of $50,000 for the model that answers the highest percentage of questions correctly. As of early January, 7702 unique users have downloaded the training set and 577 teams have scores on the leaderboard. Scores on the validation question set have climbed steadily, recently surpassing 57%: Current AI Challenge Leaderboard.

This challenge is unique among Kaggle competitions, which typically focus on applying standard machine learning techniques to predict variables in large data sets. This competition is asking the AI community to dig deeper and put together strategies that attempt to take advantage of knowledge and reasoning to select the right answer to a given question. The community’s hard work on this problem will serve as a clear demonstration of how far AI has come in the realm of understanding and answering science questions, as well as bring us closer to figuring out just what it’s going to take for AI to master eighth grade science and beyond.

The Allen AI Science Challenge will conclude in early February, and the winner will be announced at this year’s AAAI Conference in Phoenix on February 16th, 2016.

What’s next?

AI2 is interested in continuing to develop better ways to measure true progress in the field of artificial intelligence. This means designing tests that are more objective, more understandable, and more applicable to the global challenges we face. Let us know about ideas you might have for what the next Allen AI Challenge should be!



Follow AI2 on twitter at @allenai_org and on Facebook at facebook.com/alleninstituteforai.