Chatbots need help to seem human David GABIS/Alamy Stock Photo

“I can’t define obscenity, but I know it when I see it.” US Justice Potter Stewart’s famous turn of phrase could also be an apt description of the Turing test, our judgment of whether an AI seems convincingly human. Little clues immediately make it obvious that Siri or Alexa are driven by nonhuman intelligences, but you might be hard pressed to put your finger on exactly what gave it away.

But a new AI can do it. Given a snippet of dialogue between a chatbot and a human, the system predicts how convincingly human you or I would rate the chatbot’s response. This could be useful for building better virtual assistants.

Today’s chatbots are great for specific tasks – they can order you a pizza or check the weather – but try asking one if it’s enjoying the weather.


Computer scientists are divided about how the Turing test should work, but most agree that if a chatbot can fool a majority of human judges into thinking they are talking to another human, then it passes the test. That’s not a problem for big companies like Amazon, which uses large teams of human testers to help evaluate Alexa, its voice-operated personal assistant. But for firms with fewer resources it’s expensive and time-consuming to use humans as judges, says Ryan Lowe at McGill University in Montreal, Canada.

Turing test

Would it be possible to cut humans out of this process altogether and just automate the Turing test instead? To find out, Lowe designed an AI system that automatically rates how human-like a piece of chatbot-generated dialogue sounds.

He chose 1000 short Twitter conversations, and got human volunteers to add a response. Then he got several chatbots of varying abilities to add their own response. Then he asked a different group of human volunteers to rate all the responses according to how human-like they were.

Lowe then trained his neural network on these human ratings, teaching it to differentiate between convincing and unconvincing responses.

After its training, Lowe’s algorithm was able to match the judgment of the human evaluators. However, it only took a fraction of a second to reach its decision, vastly speeding up the current process of reaching such a consensus.

“This will mean you can create better chatbots,” says Oliver Lemon at Heriot-Watt University in Edinburgh, UK. You could build this into chatbots to train them to maximise their evaluation scores so they gradually learn to respond in a human-like way. This is very similar to the way that some of the best image recognition algorithms are trained.

The system still needs some tweaking. Sometimes being indistinguishable from a human will make a system less useful, Lowe says. For example, answering “I don’t know” to a question is very human, but it’s a frustrating response to hear from a chatbot.

Lowe plans to open-source his chatbot evaluator, so that other researchers can use it to improve their own chatbots.