The conversational AI created by IBM called Project Debater is designed to have a formal debate with a person. While Project Debater did lose its final debate to a person, who was a champion debater, it had its bright spots and solidly entered the “uncanny valley” for natural language processing (NLP) systems – close enough to human that its mistakes were cringe-worthy.

What is remarkable, given all of the deep learning hardware accelerators available in the market, is that IBM Research trained and delivered Project Debater on a collection of older hardware, without using hardware accelerators like GPUs. This speaks to upside for conversational AI capabilities in the next few years. It also may fuel social discontent as it becomes increasingly difficult for people to discern that they are conversing with AIs.

The initial proposal for Project Debater was a single PowerPoint slide presented in 2011. Due the volume of submissions and the careful considerations for its grand challenge style projects, it took about one year to approve the project, and in 2012 IBM Research established a project team. “Intense” work began on Project Debater in 2014, six years ago.

How can an ensemble of deep learning models be trained to debate? Debating is much more complex than simply accessing facts. IBM Research breaks the task of training an ensemble of AI models to debate into three distinct capabilities: listening comprehension, modeling human dilemmas, and data-driven speech writing and delivery.

Listening Comprehension

IBM defines listening comprehension as “the ability to identify the key concepts and claims hidden within long continuous spoken language.” IBM taught Project Debater to debate randomly selected topics, but only if the selected topics are well covered in system’s massive 400 million article (10 billion sentence) corpus, most of which are articles from well-known newspapers and magazines.

A critical part of Project Debater’s listening comprehension capability is its claim detection engine. The claims detection engine finds the exact boundaries of individual claims in a sentence (if a given sentence makes claims). There are three cascading deep learning models in IBM’s claims engine:

Find a sentence with a claim. Find the claim boundaries. Determine a confidence that there is a claim and it has been properly bounded.

Modeling Human Dilemmas

IBM created what it calls a “unique knowledge graph representation” to assist Project Debater in modeling the world of human controversy and dilemmas. Once a topic has been given to Project Debater, it searches its knowledge graph, looking for the most relevant principled arguments to support or argue with the topic. The knowledge graph model enables Project Debater to model commonalities between many different debates.

Data-Driven Speech Writing and Delivery

Project Debater then assembles the selected arguments into a complete persuasive narrative that fits within its allotted speaking time. It then writes a speech and delivers the speech with clarity, purpose, and (optionally) humor (where appropriate).

IBM emphasized the speech writing and delivery portions of NLP as Project Debater’s distinguishing feature. However, we believe that understanding human speech will be the key technology that will either restrain AI in the uncanny valley or move it past.

Note that OpenAI’s text prediction model performs a similar role to IBM’s Project Debater AI in one respect; it writes a story. But that is all OpenAI’s newswriting is intended to accomplish and it does so without the time limits of a formal debate.

Complex Ensemble Of Models

IBM Research says that Project Debater is assembled from dozens of deep learning and machine learning models, depending on how the models are counted. They said the total number of models used during the debate was well under 100 models.

Project Debater’s development ran into a classical machine learning challenge. The ensemble complexity increased due orchestrating so many underlying AI models. Each AI model trained on its own, with many of the models using different configurations and setup methodologies. Some of the models used supervised learning techniques, other models used unsupervised learning techniques. IBM Research did not talk about the cumulative time it took to train all of the models in the ensemble.

Much of the research and development for Project Debater was performed at IBM Research’s lab in Haifa, Israel. IBM said that Project Debater training used on the order of ten Lenovo System x3650 M5 servers coordinated via IBM’s Platform Load Sharing Facility (LSF) software. IBM Research also used a GPFS cluster for storage on prem in Haifa during training. This local storage ran on a two-node IBM Spectrum Virtualize (SVC) cluster using IBM’s SAN64B-6 storage networking.

IBM is not yet discussing the machine learning and deep learning frameworks they used in training Project Debater’s models. The company did say that once it commercializes the technology for customer use, it will publish more of the hardware and software detail.

Note that training Project Debater’s complex ensemble of models was performed on the equivalent of a quarter rack of unaccelerated X86 servers containing on the order of two hundred processor cores. AI research often doesn’t need the latest and greatest hardware – innovation and time can make up for a big budget.

Example Of Complexity – Debate Rebuttal

Project Debater’s rebuttal arguments are created by separate deep learning models than are used to create the opening arguments in a debate. There are several types of models in the complete rebuttal ensemble. The rebuttal argument system is usually a longer cascade of models than the ensemble of models used to create the opening arguments.

Project Debater implements claim detection for both sides of a debate. This enables the system to build both sets of arguments automatically and then determine which of the claims an opponent is likely to use. The system then cross-checks the list of likely opponent claims with an opponent’s actual claims during a debate. Here’s part of the process:

Determine whether each claim attacks or supports Project Debater’s given position on the argument.

Determine whether the opponent used any of the automatically detected claims during their most recent arguments.

Find evidence for rebuttal against the claims the opponent has already spoken.

Find evidence that supports Project Debater’s position.

Integrate evidence into correct and persuasive statements.

Determine if the point has already been claimed or delivered in an earlier segment of the debate.

Debate Deployment Infrastructure

There were two full runtime ensembles of Project Debater running concurrently and separately for resiliency:

The production (live debate) system used a blend of IBM Cloud and on prem infrastructure running in IBM Research’s lab in Haifa, Israel.

The backup instance was entirely deployed in IBM Cloud.

IBM Research’s runtime Project Debater system was composed of the following:

The primary server was a dual-socket IBM/Lenovo System x3650 M5 server. The server housed two 14-core Intel Xeon E5-2600 v4 processors and 768 GB of system memory. One instance was deployed on prem and one in IBM Cloud’s Dallas datacenter.

An Elasticsearch cluster. The cluster contained four bare-metal machines each with 64 GB memory, 12 cores, and two 960 GB SSD disks. Two instances were deployed in IBM Cloud: one in Paris and one in Dallas.

A Cassandra database cluster. The cluster contained four Linux VMs each with 32 GB memory and 4 cores. Two instances of the cluster were deployed: one in IBM Cloud in Dallas and the other on-prem in Haifa.

The servers were networked using IBM Cloud networking infrastructure at 10 Gb/sec. For live debate in San Francisco, IBM used only 10 Mb/sec Internet connectivity to connect its on-stage control laptops to IBM Cloud and the IBM Research lab in Haifa.

IBM Watson’s commercial cloud-based Speech to Text service and Text to Speech service. Text to Speech was run using the female voice on an IBM Cloud Kubernetes cluster.

There were additional services running on an IBM cloud Kubernetes cluster for handling voting, event flow management and background screen rendering.

IBM built a kiosk for Project Debater’s on-stage presence. The kiosk contained only two flat screens displaying Project Debater’s graphics avatar.

IBM did not use compute accelerators, such as GPUs or FPGAs, in its production Project Debater systems (aside from any that might be embedded behind commercial IBM Watson cloud services).

IBM noted that its development team invested in optimizing its runtime models to meet a one- to two-minute latency window for responding to its opponent’s opening speech and rebuttal speech.

We are impressed that IBM’s Project Debater runtime was deployed on old, mainstream servers and yet still performed so well. It is likely that this whole system might be collapsed into 6U rack height with today’s state-of-the-art servers and storage systems.

Realtime Challenges

Project Debater is not a short sentence, quick response conversational system. It is not designed to engage in a conversation with little to no context at the start and then build context on the fly. Project Debater is given context in the form of the “resolution” (the debate topic, see “Structure” below) at the start of each debate. A conversation is composed of much shorter clusters of phrases and sentences. For Project Debater engage in an active and responsive conversation, its response latency should be on the order of one to two seconds.

Project debater only listened to its opponent’s microphone. It received no aural or visual feedback from an audience. In a live debate format, a human debater watches the audience to assess the audience’s real-time reaction to arguments. Remember that the audience grades the debaters, it’s not the moderator the debaters need to impress. In principle, we believe that an entirely new ensemble of models might be added to a Project Debater successor to assess an audience’s visual reaction (head nods, boredom, excitement, etc.) and its aural reaction (claps, gasps, chuckles, etc.) to arguments (that is not something IBM has planned, as debating is not a commercial interest for IBM). No human audience is completely stoic, so this feedback enables debaters to change course during their speech. Also, a system such as Project Debater would have to create several narratives in advance, until it becomes capable of crafting alternate arguments on the fly in response to audience cues.

Solving both of these problems, for group speaking and for one-on-one chats with individual humans via smartphone, webcam, and so forth will enable conversational systems to appear to be much more human. We have more thoughts about the implications of being more human at Modernizing the Turing Test for 21st Century AI.

Extending Human Knowledge

As part of IBM’s research, the Project Debater team developed 20 benchmark datasets, all of which have been released under either Creative Commons License (CC BY-SA 3.0) or GNU Free Documentation License (GFDL), including:

19,276 pairs of Wikipedia concepts with manual scores for their level of readiness

5,000 idioms with sentiment annotation

3,000 sentences annotated with mentions

2,394 labeled claims on 55 topics

60 speeches recorded by professional debaters about controversial topics with transcripts, raw and cleaned

IBM Research published 32 papers describing much of its work in designing and training Project Debater and has done a lot of good work in bias detection, both in detecting bias in datasets and detecting bias in trained models. It also has some descriptive snippets of text that point back to its research papers.

However, sharing research papers and data sets is not the same as sharing trained models. As we note above, IBM Research disclosed only a few, high-level details about its training hardware, but no details about software frameworks, or specific trained models it deployed in the ensemble of runtime models. IBM did publish the training datasets, but not the training code for specific models nor any weights associated with its trained models. We have only a rough verbal description of the runtime software architecture to go by. As we mention above, IBM stated that they will publish more details after the technology is in-market.

Given that IBM is commercializing Project Debater technology in its Speech by Crowd product, its not too surprising that it isn’t giving many hints about training.

This is effectively the same path that OpenAI took. OpenAI published an extensive blog and a well-documented paper, describing the multi-task learner algorithms behind its 1.5 billion parameter GPT-2 “Transformer” model. But, OpenAI also chose not to publish “the dataset, training code, or GPT-2 model weights,” choosing instead to publish a smaller, less capable trained model.

OpenAI cited the possible misuse of its more capable model in its decision not to release that model. But, a month after announcing and not publishing the model, OpenAI then announced it will monetize such models via a “capped-profit” spin-out company.

It seems that AI is all fun and games until it does something profitable, which happens surprisingly often these days.

Governing AI

The downside of incredibly rapid progress in AI is a building cultural backlash against the misuse of AI. The recent SXSW Interactive conference included an “Intelligent Future” track with many sessions highlighting the pros and cons of AI bias and ethics.

We attended the European Union (EU) sponsored panel session “Algorithms Go to Law School: The Ethics of AI” on March 11, 2019. One of the topics of discussion was the upcoming European Commission (EC) “Ethics guidelines for trustworthy AI.” The final version of the guidelines will be delivered to the EC on or before April 9.

However, during the questions and answer session that followed the panel discussion, a reporter asserted that AI was too dangerous a technology to pursue and that “all work in AI should be stopped.” The assertion itself wasn’t shocking, but it was shocking that about a third of the audience applauded the assertion.

SXSW is typically a year or two ahead of the curve for technology-driven social issues. The strong interest in AI ethics and bias we experienced at SXSW is probably a good indicator of things to come at general social level in the next few years. Our industry will need to become more transparent to build more citizen and consumer trust.

Conclusion

IBM held the final demonstration of its Project Debater AI at it’s Think event on February 11. Project Debater lost its final demonstration debate, but in the process of getting to the debate, IBM Research built a system capable of having a credible, context sensitive debate with a human, on a human time scale.

IBM designed Project Debater to engage in a meaningful discussion with a human, but with the human opponent’s and any human bystanders’ full understanding that Project Debater is an AI – with the explicit knowledge that it is not human. IBM Research designed the system to add jokes to speeches and also gave the system a voice created with a New York-based actress. But IBM Research also made sure that it was an obviously synthesized voice. The end result sounds machine generated but expressive and not monotone, which is important for debating.

Project Debater’s opening arguments were much better than we anticipated – they challenged our assumptions about the current capabilities of NLP. However, in its rebuttal and summary speeches Project Debater missed several nuances of its human opponent’s arguments and showed no empathy with its human audience. Its gaps in understanding and lack of empathy consigned Project Debater to the uncanny valley.

Creating an AI that can participate in this style of open topic debate was a grand challenge in every respect.

We also might argue that creating an AI that can credibly debate a human on random topics (win or lose) is the equivalent of passing the infamous Turing Test. What is the Turing Test, really, and what does “passing the Turing Test” mean? We took a stab at answering those questions at Modernizing the Turing Test for 21st Century AI. The short answer is that passing a test for general conversational intelligence may require a far more human context than we can approach with AI ensembles today.

While IBM Project Debater was impressive, it’s going to be difficult to move conversational systems out of the uncanny valley for the next few years. But, proving something can be done is half the battle for other people to replicate the results. Training and delivering Project Debater on a patchwork of older hardware was an impressive feat of research and development. Could Project Debater have been trained and deployed faster using newer hardware? It’s hard to tell, but we believe others are likely to be devoting a lot more resources to conversational systems; we’ll see how fast the field evolves.

It is certain that conversational systems will become more competent and sophisticated. This will affect both jobs and social structures. It’s time to have a serious conversation about the future of NLP.