At a small event in San Francisco last night, IBM hosted two debate club-style discussions between two humans and an AI called “Project Debater.” The goal was for the AI to engage in a series of reasoned arguments according to some pretty standard rules of debate: no awareness of the debate topic ahead of time, no pre-canned responses. Each side gave a four-minute introductory speech, a four-minute rebuttal to the other’s arguments, and a two-minute closing statement.

Project Debater held its own.

It looks like a huge leap beyond that other splashy demonstration we all remember from IBM when Watson mopped the floor with its competition at Jeopardy. IBM’s AI demonstration today was built on that foundation. It had many corpora of data it could draw from, just like Watson did back in the day. And like Watson, it was able to analyze the contents of all that data to come up with the relevant answer. But this time, the “answer” was cogent points related to subsidizing space and telemedicine laid out in a four-minute speech defending each.

Project Debater cited sources, pandered to the audience’s affinity for children and veterans, and did a passable job of cracking a relevant joke or two in the process.

That’s pretty impressive. It essentially created a freshman-level term paper kind of argument in just a couple of minutes when presented with a debating topic it had no specific preparation for. The system has “several hundred million articles” that it assumes are accurate in its data banks, which are about 100 areas of knowledge. When it gets a debate topic, it takes a couple of minutes to spelunk through them, decides what would make the best arguments in favor of the topic, and then creates a little speech describing those points.

Some of the points it made were pretty facile; some quoted sources, and some were pretty clearly cribbed from articles. Still, it was able to move from the “present information” mode we usually think of when we hear AI to a “make an argument” mode. But what impressed me more was that it attempted to directly argue with points that its human opponents made, in nearly real time. (The system needed a couple minutes to analyze the human’s four-minute speech before it could respond.)

Was the AI arguing in good faith? I wasn’t entirely sure

It frankly made me feel a little unsettled, but not because of the usual worries like “robots are going to become self-aware and take over” or “AI is coming for our jobs.” It was something subtler and harder to put my finger on. For maybe the first time, I felt like an AI was trying to dissemble. I didn’t see it lie, nor do I think it tried to trick us, but it did engage in a debating tactic that, if you saw a human try it, would make you trust that human a little bit less.

Here was the scene: a human debater was arguing against the notion that the government should subsidize space exploration. She set up a framework for understanding the world, which is a pretty common debating tactic. Subsidies, she argued, should fit one of two specific criteria: fulfilling basic human needs and creating things that only could be done by the government. Space exploration didn’t fit the bill. Fair enough.

Project Debater, whose job was to respond directly to those points, didn’t quite rebut them directly. It certainly talked in the same zone. It claimed that “subsidizing space exploration usually returns the investment” in the form of economic boosts from scientific discovery, and it said that for a nation like the United States, “having a space exploration program is a critical part of being a great power.”

What Project Debater didn’t do was directly engage the criteria set forth by its human opponent. And here’s the thing: if I were in that debate, I wouldn’t have done so either. It’s a strong debating tactic to set the framework of debate, and accepting that framework is often a recipe for losing.

So the question is: did Project Debater simply not understand the criteria, or did it understand and choose not to engage on those terms? Watching the debate, I figured the answer was that it didn’t quite get it, but I wasn’t positive. I couldn’t tell the difference between an AI not being as smart as it could be and an AI being way smarter than I’ve seen an AI be before. It was a pretty cognitively dissonant moment. Like I said, unsettling.

“If it really believes it understands what that opponent was saying, it’s going to try to make a very strong argument against that point specifically.”

Jeff Welser, the VP and lab director for IBM research at Almaden, put my mind at ease. Project Debater didn’t get it. But it didn’t get it in a really interesting and important way. “There’s been no effort to actually have it play tricky or dissembling games,” he tells me (phew). “But it does actually do … exactly what a human does, but it does it within its limitations.”

Essentially, Project Debater assigns a confidence score to every piece of information it understands. How confident is the system that it actually understands the content of what’s being discussed? “If it’s confident that it got that point right, if it really believes it understands what that opponent was saying, it’s going to try to make a very strong argument against that point specifically,” Welser explains.

“If it’s less confident,” he says, “it’ll do its best to make an argument that’ll be convincing as an argument even if it doesn’t exactly answer that point — which is exactly what a human does, too, sometimes.”

So, the human says that government should have specific criteria surrounding basic human needs to justify subsidization. Project Debater responds that space is awesome and good for the economy. A human might choose that tactic as a sneaky way to avoid debating on the wrong terms. Project Debater had different motivations in its algorithms, but not that different.

The point of this experiment wasn’t to make me think that I couldn’t trust that a computer is arguing in good faith — though it very much did that. The point is that IBM was showing off that it can train AI in new areas of research that could eventually be useful in real, practical contexts.

The first is parsing a lot of information in a decision-making context. The same technology that can read a corpus of data and come up with a bunch of pros and cons for a debate could be (and has been) used to decide whether or not a stock might be worth investing in. IBM’s system didn’t make the value judgment, but it did provide a bunch of information to the bank showing both sides of a debate about the stock.

“This is still a research-level project.”

As for the debating part, Welser says that it “helps us understand how language is used,” by teaching a system to work in a rhetorical context that’s more nuanced than the usual “Hey Google, give me this piece of information and turn off my lights.” Perhaps it might someday help a lawyer structure their arguments, “not that Project Debater would make a very good lawyer,” Welser joked. Another IBM researcher suggested that this technology could help judge fake news.

How close is this to being something IBM turns into a product? “This is still a research-level project,” Welser says, though “the technologies underneath it right now” are already beginning to be used in IBM projects.

In the second debate, about telemedicine, Project Debater once again had a difficult time parsing the exact nuance its human opponent was making about how important the human touch is in diagnosis. Rather than discuss that, it fell back to a broader argument, suggesting that maybe the human was just afraid of new innovations.

”I am a true believer in the power of technology,” quipped the AI, “as I should be.”