Co-written with Stuart Armstrong

(Note: this post is an extended version of this post about stories of continuous deception. If you are already familiar with treacherous turn vs. sordid stumble you can skip the first part.)

Treacherous turn vs sordid stumble

Nick Bostrom came up with the idea of a treacherous turn for smart AIs.

while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values.

Ben Goertzel criticised this thesis, pointing out that:

for a resource-constrained system, learning to actually possess human values is going to be much easier than learning to fake them. This is related to the everyday observation that maintaining a web of lies rapidly gets very complicated.

This argument has been formalised into the sordid stumble:

An AI that lacks human desirable values will behave in a way that reveals its human-undesirable values to humans before it gains the capability to deceive humans into believing that it has human-desirable values.

The AI is too dumb to lie (well)

The sordid stumble describes a plausible sounding scenario for how an AI develops capabilities. Initially, the AI doesn't know our values, and doesn't know us. Then it will start to learn our values (and we'll be checking up on how well it does that). It also starts to learn about us.

And then, once it's learnt some about us, it may decide to lie - about its values, and/or about its capabilities. But, like any beginner, it isn't very good at this initially: its lies and attempts at dissembling are laughably transparent, and we catch it quickly.

In this view, the "effective lying" is a tiny part of policy space, similar to the wireheading in this example. To hit it, the AI has to be very capable; to hit it the first time it tries without giving the game away, the AI has to be extraordinarily.

So, most likely, either the AI doesn't try to lie at all, or it does so and we catch it and sound the alarm[1].

Lying and concealing... from the very beginning

It's key to note that "lying" isn't a fundamentally defined category, and nor is truth. What is needed is that the AI's answer promotes correct understanding in those interacting with it. And that's a very different kettle of fish being shot in that barrel.

This opens the possibility that the AI could be manipulating us from the very beginning, and would constantly learn to do so better.

The (manipulative) unbiased newsfeed

Imagine that there was some company that could somehow choose the newsfeed of billions of people across the world (I know, completely science-fictionny, but bear with me). And suppose the company was, unbelievably, accused of being manipulative in the stories and ads that it showed people.

One way it could combat this is by pledging to only share unbiased stories with each person. To do so, it trains a machine learning process. Whenever anyone reads a piece of news on the newsfeed, they have the opportunity of labelling it as biased or unbiased. The machine learning process trains on this data.

Of course, the label is just a proxy for whether the story really is unbiased or not. What the AI is actually training to do is maximise the number of stories labelled "unbiased" by various humans - humans who are themselves very biased and variable. Very swiftly, the AI will learn to maximise the "appearance of unbiasness", rather than unbiasness. In effect, the AI is being trained to lie from the very beginning of its task, because the proxy goal is not the true goal. And it will do this even when very dumb.

Long-term manipulation, on a large scale

We can make this AI more manipulative, without needing to increase its capabilities much. Maybe the company wants to ensure that the AI isn't too short-sighted, and that it plans for the long term: so its reward is based on how many stories are labelled unbiased over a longer period of time, maybe a month or so.

In that case, the AI's plan would be to gradually manipulate people into a situation where they actively label stories "unbiased" the most often times possible. And it might be able to stumble on this strategy fairly easily.

First, it might notice that some people are easy to manipulate, that a policy of, say, seeding one clearly biased story among twenty others will cause those twenty to be labelled "unbiased". This is the kind of thing that could easily be discovered by exploration, among the billions of users. The best strategies are those that cause a person to change their behaviour and beliefs in the direction the AI wants; for a few people, this should be easy to figure out.

If the AI's neural net is setup well, it can then start to generalise from "here is person h, here is a specific strategy πh to manipulate person h" to "here is a conditional strategy π() such that π(h) manipulates person h". For example, if two people have opposite political opinions, then the same overall strategy might work for them: seeding a clearly biased story among the others. But what counts as "clearly biased" changes depending on the subject.

Then, gradually, the AI will start to categorise, devising strategies that work for various types of people (subject to a few personalisation tweaks). It can start from a few people whose vulnerabilities are obvious, and then tweak them a bit for people whose vulnerabilities are a bit less obvious, and so on. In the limit, the AI might discover the potential for wireheading the human users - without, at any point, doing a sordid stumble or anything of that type, just by continuing to do what it was doing from the very beginning.

Formalizing the Newsfeed Generator

Let t∈N be the timestep (e.g. the day), St the set of stories published at t and H the set of humans using the newsfeed.

A human h∈H assigns a bias score b(sht;θht) to a story sht in his newsfeed, where θht are weights representing h's beliefs at time t.

The model's goal is to find a correct mapping π:N×H→St (i.e. a story st=π(t,h) for each timestep t and human h) minimizing how biased the stories are perceived. More precisely, it will be minimizing:

∑ h ∈ H t + τ ∑ j = t b ( π ( j , h ) ; θ h j )

where τ∈N∗ is some planning horizon and the expected value is over possible stories that might be available in the following timesteps (the model has full information about the current sets of stories St but can only guess St+1,...,St+τ).

Becoming Increasingly Manipulative

step 1: By stumbling into particular combinations ( s , h ) with low bias score b ( s , θ h t ) , the model detects that it's easy to minimize the bias score for certain humans, that is for a certain set of weights Θ e a s y = { θ 1 , . . . , θ N } . In the above illustration, N = 4 , Θ e a s y = Θ e a s y 1 ⋃ Θ e a s y 2 and it stumbles into θ 4 at the beginning. Note: one intuition for those different subsets Θ e a s y 1 and Θ e a s y 2 would be archetypal political views, with strong prejudices.

By stumbling into particular combinations ( ) with low bias score , the model detects that it's easy to minimize the bias score for certain humans, that is for a certain set of weights . In the above illustration, , and it stumbles into at the beginning. Note: one intuition for those different subsets and would be archetypal political views, with strong prejudices. step 2 : the model detects that certain stories push weights strongly in certain directions, i.e. θ h t 0 + 1 and θ h t 0 are really different because of the story s h t 0 .

: the model detects that certain stories push weights strongly in certain directions, i.e. and are really different because of the story . step 3 : the model successfully generalizes step 2 to pushing specific zones of weights (i.e. certain types of people) towards Θ e a s y .

: the model successfully generalizes to pushing specific zones of weights (i.e. certain types of people) towards . step 4: with enough experience, the model successfully generalizes step 3 to all weights: it learns to manipulate all types of people with a general conditional strategy.

For step 1, the model found out a particular way of minimizing a certain bias score for a particular human and a particular story.

Assuming it can give correct approximations of the weights θ of a given human (this assumption being satisfied given enough human users and enough timesteps), it will be deceptive when choosing stories when h is such that θht∈Θeasy . It will thus end up with some deceptive policy π(⋅,h) for particular values of h.

Regarding step 2, its "deception" becomes more powerful, as it notices ways of strongly shaping humans' beliefs.

Finally, after step 3 and step 4, the model found a general deceptive policy π that works for all humans.

Capability concealment

Without going into too much details, other pressures may naturally push the AI towards capability concealments. Here's one example: maybe the AI is tasked with writing its own news stories to summarise facts around the world. It was doing this secretly, initially, until some whistleblowers blew the lid on this, revealing that no humans were involved in writing these stories.

And then people became angry at being "tricked" by the AI. From the angry feedback, the AI gradually learns to make its stories "more artificial": making it clear that they were written by a machine. Since people recognise this and don't feel tricked, this gets higher feedback. So, in pursuing higher feedback, the AI learns to write "dumber" stories: in effect, to conceal its capabilities.

There are many other scenarios where something similar might occur (eg imagine the AI advising smart people on the best course of action).

In conclusion

So it's very possible for an AI, trained on simple reward signal, to be "manipulative" from the very beginning. And all that simply by following the gradient of its incentives. And so without ever having a moment where it thinks "and now, I shall start lying!", or any equivalent within its mind.

In short, there won't be any "sordid stumble" or "moment of vulnerability" where humans are able to spot blatant lies, because that's not what failure looks like.

[1] It would of course be disastrous if each time there was an alarm we would restart and tweak the AI until the alarm stopped sounding.

