When Jacob Aron helped judge an artificial intelligence contest, the entrants did not interview well. Better to judge face recognition or even poker skills

Trees or broccoli? (Image: Reuters/Gleb Garanich)

LAST Saturday I took part in a battle of wits at Bletchley Park, the stately home that housed the UK’s codebreakers during the second world war. I was a judge in the annual Loebner prize competition, held to determine whether computers can think just like a human. It probably won’t surprise you that they can’t, but machines are increasingly giving us a run for our money at certain tasks.

Bletchley is a fitting arena: the competition is based on a test proposed by mathematician and computing pioneer Alan Turing, who spent the war there cracking Nazi codes. He argued that if a computer could fool a person into believing it was human, it could think.


With the Loebner prize, four human judges each sit at a computer and carry out two text-based conversations at the same time – one with a real person hidden in a separate room, the other with a chatbot. The judges have 25 minutes to figure out which is which, before moving on to another human/AI pair.

In practice I needed just minutes to tell human from machine. One bot began with the novel strategy of bribing me to split the prize money if I declared it human, while another claimed to be an alien on a spaceship. These tactics didn’t work. The humans quickly made themselves known by answering simple questions about the weather or surroundings, which the bots either ignored or got hopelessly wrong.

In the end none of the four bots fooled any of the judges, and, as with every other contest in the history of the Loebner prize, the best performer only earned a bronze medal. So are we any nearer to a true artificial intelligence?

One problem with the Turing test is that no one can quite agree what counts as a pass. Turing, writing in the 1950s, predicted that by the 21st century it would be possible for computers to pass the test around 30 per cent of the time. Some have interpreted this as the percentage of judges a machine has to fool, leading to headlines last year claiming that a chatbot at the Royal Society in London had passed the test. Others see 50 per cent as a pass.

But even if one of the chatbots had managed to fool us all last week, that wouldn’t really have told us anything about its intelligence. That’s because the results of the test also depend on the judges’ level of technical understanding and choice of questions, which will colour their ratings.

As a result, most AI researchers long ago abandoned the Turing test in favour of more reliable ways to put machines through their paces. In just the past couple of years, algorithms have started to match and even exceed human performance at tasks outside the realm of everyday conversation.

“I spend my time trying to get computers to understand the visual world rather than win a Turing test, because I think it’s a quicker route to intelligence,” says Erik Learned-Miller of the University of Massachusetts, Amherst. He is one of the people behind the Labeled Faces in the Wild (LFW) data set. A collection of more than 13,000 facial images and names, taken from the web, it has become the de facto standard for testing facial recognition algorithms.

There have been vast improvements in this field thanks to hardware and software advances in deep learning and neural networks, AI techniques that attempt to mimic the neuron structure of the brain. Last year Facebook published details of its DeepFace algorithm, which scored 97.25 per cent accuracy on the LFW data set, just short of the human average of 97.5 per cent.

“When they got that, people realised this is the way to go,” says Learned-Miller. According to him, it kicked off an arms race between tech’s biggest names. This year Google’s FaceNet system hit 99.63 per cent – seemingly better than humans. That’s not quite the case, says Learned-Miller, as it’s hard to measure our performance accurately, but “machines are about comparable to humans”.

Big companies are also testing their algorithms on a data set called ImageNet, a more general collection of labelled images, and vying to win the Large Scale Visual Recognition Challenge. In advance of this year’s contest, to be held in November, Microsoft has published details of its algorithm that scores a record-beating 95.6 per cent – again, just slightly ahead of humans on this task.

But one of the challenge’s organisers, Olga Russakovsky of Carnegie Mellon University (CMU) in Pittsburgh, Pennsylvania, points out that the algorithms only have to identify images as belonging to one of a thousand categories. That’s tiny compared with what humans can achieve. “Even if you can recognise all objects, that’s very far from building an intelligent machine,” she says. To show true intelligence, machines will have to draw inferences about the wider context of an image and what might happen 1 second after a picture was taken, she says.

An altogether different task is allowing other bots to display abilities of this type. When humans have to make decisions based on partial information, we try to infer what other people will do. Could an AI do the same? “Poker has become the benchmark for measuring intelligence in these incomplete information settings,” says Tuomas Sandholm, also at CMU.

The uncertainties of poker make it a much harder game for machines than chess, at which computers are now unbeatable. In January a team at the University of Alberta in Edmonton, Canada, published details of a poker bot that can beat any human, but only at a simpler form of the game.

In proper poker, humans still hold the edge, but only just. A few months ago, Sandholm pitted his bot against a team of poker pros. It lost by a slim margin. “At least 99.9 per cent of humans would be much worse than our program,” he says. This kind of tournament is an improvement over the Turing test, he says. “I like it a lot as a test, because it’s not about trying to fake AI. You really have to be intelligent to beat humans.”

Is there any life left in the Turing test? Bertie Müller of the Society for the Study of Artificial Intelligence and Simulation of Behaviour, which administers the Loebner prize, says the contest is held partly for tradition’s sake. Turing himself might not view it as the best test of intelligence were he alive today, he says. For Müller, a better test might be to observe an AI in a variety of environments – a bit like putting a toddler in a room full of toys and studying what it does.

“There has been a shift to trying to replicate these more fundamental abilities on which intelligence is built,” says Learned-Miller. All the researchers I spoke to agreed that a truly intelligent machine would have to be able to get a sense of the real world through computer vision, and not just be confined to a text-based interface. But we are a long, long way from putting all the pieces together to get a thinking machine.

Whose AI is top of the class? How best to put artificial intelligence to the test? Assessments straight out of the classroom are catching on. In a study published last week, an AI system called ConceptNet tackled an IQ test designed for preschoolers, fielding questions like “Why do we wear sunscreen in summer?” Its results were on a par with the average 4-year-old (arxiv.org/abs/1509.03390). And last year, a system called To-Robo passed the English section of Japan’s national college entrance exam. At the Allen Institute for Artificial Intelligence in Seattle, Washington, Peter Clark and colleagues are honing a program called Aristo by giving it New York state school science exams. They are a great exercise, says Clark, requiring machines to do basic reasoning and handle questions they might not have seen before. Not everyone is convinced. Ernest Davis, a computer scientist at New York University in New York City, points out that AI often struggles with what we would regard as common sense. From that viewpoint, ordinary exams might not be the best way to measure machines’ progress. Instead, he suggests writing exams specifically for machines. The questions would be trivial for a human but too strange or obvious to be looked up online, such as: “Is it possible to fold a watermelon?” Aviva Rutkin