This post is longer and more self-contained than my recent stubs.

tl;dr: Patches such as telling the AI "avoid X" will result in Goodhart's law and the nearest unblocked strategy problem: the AI will do almost exactly what it was going to do, except narrowly avoiding the specific X.

However, if the patch can replaced with "I am telling you to avoid X", and this is treated as information about what to avoid, and the biases and narrowness of my reasoning are correctly taken into account, these problems can be avoided. The important thing is to correctly model my uncertainty and overconfidence.

AIs don't have a Goodhart problem, not exactly

The problem of an AI maximising a proxy utility function seems similar to the Goodhart Law problem, but isn't exactly the same thing.

The standard Goodhart law is a principal-agent problem: the principal P and the agent A both know, roughly, what the principal's utility U is (eg U aims to create a successful company). However, fulfilling U is difficult to measure, so a measurable proxy V is used instead (eg V aims to maximise share price). Note that the principal and the agents goals are misaligned, and the measurable V serves to (try to) bring them more into alignment.

For an AI, the problem is not that U is hard to measure, but that it is hard to define. And the AI's goals are V: there is no need to make V measurable, it is not a check on the AI, but the AI's intrinsic motivation.

This may seem like a small difference, but it has large consequences. We could give an AI a V, our "best guess" at U, while also including all our uncertainty about how to define U. This option is not available for the principal agent problem, since giving a complicated goal to a more knowledgeable agent just gives it more opportunities to misbehave: we can't rely on it maximising the goal, we have to check that it does so.

Overfitting to the patches

There is a certain similarity with many machine learning techniques. Neural nets that distinguish cats and dogs could treat any "dog" photo as a specific patch that can be routed around. In that case, the net would define "dog" as "anything almost identical to the dog photos I've been trained on", and "cat" as "anything else".

And that would be a terrible design; fortunately, modern machine learning gets around the problem by, in effect, assigning uncertainty correctly: "dog" is not seen as the exact set of dog photos in the training set, but as a larger, more nebulous concept, of which the specific dog photos are just examples.

Similarly, we could define V as W+Δ, where W is our best attempt at specifying U, and Δ encodes the fact that W is but an example our imperfect minds have come up with, to try and capture U. We know that W is oversimplified, and Δ is an encoding of this fact. If a neural net could synthesis a decent estimate of "dog" from some examples, could it synthesis "friendliness" from our attempts to define it?

The idea is best explained through an example.

Example: Don't crush the baby or the other objects

This section will present a better example, I believe, than the original one presented here.

A robot exists in a grid world:

The robot's aim is to get to the goal square, with the flag. It gets a penalty of −1 for each turn it isn't there.

If that were the only reward, the robot's actions would be disastrous:

So we will give it a penalty of −100 for running over babies. If we do so, we will get a Goodhart/nearest unblocked strategy behaviour:

Oops! Turns out we valued those vases as well.

What we want the AI to learn is not that the baby is specifically important, but that the baby is an example of important things it should not crush. So imagine it is confronted by the following, which includes six types of objects, of unknown value:

Instead of having humans hand-label each item, we instead generalise from some hand-labelled examples, using rules of extrapolation and some machine learning. This tells the AI that, typically, we value about one-in-six objects, and value them at a tenth of the value of babies (hence it gets −10 for running one over). Given that, the best policy, with an expected reward of −9−10(2/6)≈−12.333…, is:

This behaviour is already much better than we would expect from a typical Goodhart law-style agent (and we could complicate the example to make the difference more emphatic).

Example: human over-confidence

The above works if we humans correctly account for our uncertainty - if we not only produce W, but also a correct Δ for how good a match we expect between W and U.

But we humans are often overconfident in their estimates, especially in our estimates of value. We are far better at hindsight ("you shouldn't have crushed the vase") than at foresight ("here's a complete list of what you shouldn't do"). Even knowing that hindsight is better, doesn't make the issue go away.

This is similar to the planning fallacy. That fallacy means that we underestimate the time taken to complete tasks - even if we try to take the planning fallacy into account.

However, the planning fallacy can be solved using the outside view: comparing the project to similar projects, rather than using detailed inner knowledge.

Similarly, human overconfidence can be solved by the AI noting our initial estimates, our corrections to those initial estimates, our corrections taking into account the previous corrections, our attempts to take into account all previous repeated corrections - and the failure of those attempts.

Suppose, for example, that humans, in hindsight, value one-in-three of the typical objects in the grid world. We start out with an estimate of one-in-twelve; after the robot mashes a bit too many of the objects, we update to one-in-nine; after being repeatedly told that we underestimate our hindsight, we update to one-in-six... and stay there.

But meanwhile, the robot can still see that we continue to underestimate, and goes directly to a one-in-three estimate; so with new, unknown objects, it will only risk crushing a single one:

If the robot learnt that we valued even more objects (or valued some of them more than +10), it would then default to the safest, longest route:

.

In practice, of course, the robot will also be getting information about what types of objects we value, but the general lesson still applies: the robot can learn that we underestimate uncertainty, and increase its own uncertainty in consequence.

Full uncertainty, very unknown unknowns

So, this is a more formal version of ideas I posted a while back. The process could be seen as:

Give the AI W as our current best estimate for U . Encode our known uncertainties about how well W relates to U . Have the AI deduce, from our subsequent behaviour, how well we have encoded our uncertainties, and change these as needed. Repeat 2-3 for different types of uncertainties.

What do I mean by "different types" of uncertainty? Well, the example above was simple: the model had but a single uncertainty, over the proportion of typical objects that we valued. The AI learnt that we systematically underestimated this, even when it helped us try and do better.

But there are other types of uncertainties that could happen. We value some objects more than others, but maybe these estimates are not accurate either. Maybe we are fine as long as one object of a type exists, and don't care about the other - or, conversely, maybe some objects are only valuable in pairs. The AI needs a rich enough model to be able to account for these extra types of preferences, that we may not have ever articulated explicitly.

There are even more examples as we move from gridworlds into the real world. We can articulate ideas like "human value is fragile" and maybe give an estimate of the total complexity of human values. And then the agent could use examples to estimate the quality of our estimate, and come up with better number for the desired complexity.

But "human value is fragile" is a relatively recent insight. There was time when people hadn't articulated that idea. So it's not that we didn't have a good estimate for the complexity of human values; we didn't have any idea that was a good thing to estimate.

The AI has to figure out the unknown unknowns. Note that, unlike the value synthesis project, the AI doesn't need to resolve this uncertainty; it just needs to know that it exists, and give a good-enough estimate of it.

The AI will certainly figure out some unknown unknowns (and unknown knowns): it just has to spot some patterns and connections we were unaware of. But in order to get all of them, the AI has to have some sort of maximal model in which all our uncertainty (and all our models) can be contained.

Just consider some of the concepts I've come up with (I chose these because I'm most familiar with them; LessWrong abounds with other examples): siren worlds, humans making similar normative assumptions about each other, and the web of connotations.

In theory, each of these should have reduced my uncertainty, and moved W closer to U. In practice, each of these has increased my estimate of uncertainty, by showing how much remains to be done. Could an AI have taken these effects correctly into account, given that these three examples are of very different types? Can it do so for discoveries that remain to be made?

I've argued that an indescribable hellworld cannot exist. There's a similar question as to whether there exists human uncertainty about U that cannot be included in the AI's model of Δ. By definition, this uncertainty would be something that is currently unknown and unimaginable to us. However, I feel that it's far more likely to exist, than the indescribable hellworld.

Still despite that issue, it seems to me that there are methods of dealing with the Goodhart problem/nearest unblocked strategy problem. And this involves properly accounting for all our uncertainty, directly or indirectly. If we do this well, there no longer remains a Goodhart problem at all.