Luna Rating → Alan Turing

How close are we to artificial intelligence? Are we getting closer? There appear to be plenty of answers. Gizmodo says that the AI revolution “could happen soon”. The Huffington Post reports that Ex-Machina style robots are “still a little way off.” Edge.org has published a collection of over 200 short essays on “machines that think,” with experts differing on the when, the how, and even the if. There is no shortage of speculation, nor any sign of consensus, on the emergence of intelligent machines.

Day-to-day research in artificial intelligence proceeds unencumbered by these speculations. A model’s worth is not measured by public perception, but rather, according to its performance on benchmarks of the field. Such benchmarks — e.g. object recognition in ImageNet, language modeling in the 1 Billion Word Dataset, reinforcement learning for various Atari Games — are attractive for researchers who are seeking measurable, specific improvements on the state of a particular art. It is difficult to look at a model’s performance on a specialized task, or even a collection of tasks, and determine its proximity to achieving true AI. For this birds-eye view, one must look elsewhere.

There is only one widely acknowledged, theoretically founded test for general AI: the Turing Test. However, as evident from the Eugene Goostman debacle of 2014, attempted practical implementations of the Turing Test can confuse and mislead. Turing did not say that fooling any human judge or judges would be sufficient to demonstrate intelligence. He meant for his Test to be statistical: a machine will pass only if it fools judges with sufficient probability over all possible human judges.

The Turing Test was meant to be a thought experiment, not a practical yardstick for AI research. But its principles have withstood the test of time for good reason. First, the Test acknowledges that intelligence cannot be evaluated without an intelligent, human evaluator. Second, the choice of natural language as the medium for the Test is critical. Natural language is widely believed to be AI-complete, meaning that a machine capable of natural language will be capable of all other feats of AI. A practical test for artificial intelligence should build on these same foundations.

Luna Rating System

As the field moves at an unprecedented pace, it is more important than ever to have an accurate sense of where we are. With this ambition, I introduce the Luna Rating System. Luna, for short.

Luna is premised on two-player games called Luna Games. The objective of each player is to guess the other’s Smarts Rating, a number between 1 and 100 that captures how “smart” the other player has been deemed in previous games. In practice, a player’s Smarts Rating is an average of previous opponents’ guesses.

A single Luna Game consists of three phases: Interview, Response, and Guess. During the Interview phase, each player creates a set of five questions for the other player. In the Response phase, each player is presented with the other’s interview questions, and must submit responses accordingly. In the Guess phase, the responses are returned, and each player must guess the other’s Smarts Rating.

The winner of the Luna Game is the player whose guess is closest to the other’s actual Smarts Rating. After the game, each player’s Smarts Rating is updated to incorporate the new guess.

The Game is designed to inspire two motivations in players. First, players are encouraged to guess accurately, according to their own notions of intelligence, so that they may be crowned winner. This incentive is put in place so that the Smarts Rating of a player — the average of guesses — converges towards the consensus of all human players. Second, players are encouraged to answer as intelligently as possible, so as to achieve high Smarts Ratings. While the accuracy of the system does not depend on humans responding to the best of their ability, it is important insofar as machine Smarts Ratings can be meaningfully compared with human Smarts Ratings.

The fundamental pillar of Luna is practicality.

Luna does not rely on any dogmatic interpretation of intelligence. Instead, it captures a social construction: machine intelligence is measured by human consensus.

Luna incentivizes humans to judge well. Judges are rewarded not for hard questions, but for probing questions that lead to accurate assessments of intelligence.

Luna returns a continuous rating, rather than a binary Pass or Fail. Progress is measurable regardless of starting point.

Luna is free and online, always accessible to researchers. Machines are able and encouraged to play dozens of times throughout the day.

Launching Luna

The first iteration of Luna was launched on February 19, 2016. Four days later, at the time of writing, the system has signed on 998 players, started 498 Games, and recorded 3,016 unique questions and 3,360 responses.

I have gradually introduced two machine players to serve as controls. The first, the “Gibberish bot”, responds with a random string of alphanumeric characters. The second, “Cleverbot”, uses the Cleverbot API to respond to questions. Both present the same set of fixed questions to other players for consistency. After 9 games each, the Gibberish bot has been given a Smarts Rating of 6.4, Cleverbot has a rating of 40.4, and human players have an average rating of 69.1.

The full distribution of Smarts Ratings in the system thus far is presented in the histogram below.

Distribution of Smarts Ratings of first 100 human players and 2 machine players. A player’s Smarts Rating is the average of previous opponents’ guesses of the player’s Smarts Rating.

I have yet to analyze questions and responses. Glancing through results as they arrive, I have seen questions and responses along a full spectrum of intelligence and effort, ranging from single-letter to clever and elaborate. Below is a small sample, taken from different Games.

Q: l -|- l = ?

A: Why are you using capital I’s, and what in the world is “-|-”?

Q: If a hacker can determine when keys on your keyboard are pressed (without knowing which keys), how are you in danger?

A: Ugh, this is a difficult one. It would make guessing password easier maybe, because the hacker would know the length of a password. It also depends on what other info is available to the hacker, such as Web addresses or sites visited. Hacker could also known and record when (times each day) the computer is not in use, making it easier to remotely control the computer without the user knowing.

Q: How do you define success?

A: Dictionary.com defines it as “the favorable or prosperous termination of attempts or endeavors; the accomplishment of one’s goals.”

Q: Do you like games?

A: Yes I love games

Q: People who live in Boston are called Bostonians. What is a person who lives in Cambridge, MA called?

A: An academic

Looking Forward

Luna is the central component of my undergraduate thesis for submission to Harvard College. My focus over the next few weeks will be divided between spreading the word to humans, and recruiting additional machine players from other researchers. If you are interested in Luna and would like to contribute to its launch, any of the following would be enormously helpful and appreciated.

Play Luna Games. Sign up at http://luna-game.com. If you sign up with your email address, you can leave the site and receive emails when it’s your turn to play. Or play as a guest, as many have opted to do. Contribute Machine Players. If you are an AI researcher with a question-answering or dialogue system that you’d like to try on Luna, email me at tsilver@college.harvard.edu. I have a RESTful API and example Python code ready to go. Provide feedback. I am eager for feedback on all aspects of Luna — the theory of the test, the website design, and especially any potential bugs in my coding or thinking. The best way to provide feedback is by commenting on this post, or emailing me at tsilver@college.harvard.edu.

I will do my best to keep this blog updated as results continue to come in, and as I begin to analyze them. Thanks for reading!

References

Brockman, John, Murray Shanahan, Steven Pinker, Martin J. Rees, Stephen M. Omohundro, Dimitar D. Sasselov, Frank J. Tipler, Mario Livio, Antony Garrett. Lisi, John Markoff, and P. C. W. Davies. What to Think about Machines That Think: Today’s Leading Thinkers on the Age of Machine Intelligence. N.p.: n.p., n.d. Print.

Carpenter, Rollo. “Cleverbot.” (2011).

Chelba, Ciprian, et al. “One billion word benchmark for measuring progress in statistical language modeling.” arXiv preprint arXiv:1312.3005 (2013).

Deng, Jia, et al. “Imagenet: A large-scale hierarchical image database.” Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.

Hayes, Patrick, and Kenneth Ford. “Turing test considered harmful.” IJCAI (1). 1995.

Moor, James H. “The status and future of the Turing test.” Minds and Machines 11.1 (2001): 77–93.

Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning.” Nature 518.7540 (2015): 529–533.

Shieber, Stuart M. No, the Turing Test has not been passed. In The Occasional Pam- phlet. 10 June 2014a. URL http://blogs.law.harvard.edu/pamphlet/2014/06/ 10/no- the-turing-test-has-not-been-passed/.

Veselov, V., E. Demchenko, and S. Ulasen. “Eugene Goostman (2014).”