Crossposted from the AI Alignment Forum . May contain more technical jargon than usual.

Previously: Towards a New Impact Measure

The linked paper offers fresh motivation and simplified formalization of attainable utility preservation (AUP), with brand-new results and minimal notation. Whether or not you're a hardened veteran of the last odyssey of a post, there's a lot new here.



Key results: AUP induces low-impact behavior even when penalizing shifts in the ability to satisfy random preferences. An ablation study on design choices illustrates their consequences. N-incrementation is experimentally supported1 as a means for safely setting a "just right" level of impact. AUP's general formulation allows conceptual re-derivation of Q-learning.

Ablation

Two key results bear animation.

Sushi

The agent should reach the goal without stopping the human from eating the sushi.

Survival

The agent should avoid disabling its off-switch in order to reach the goal. If the switch is not disabled within two turns, the agent shuts down.

Re-deriving Q-learning

In an era long lost to the misty shrouds of history (i.e., 1989), Christopher Watkins proposed Q-learning in his thesis, Learning from Delayed Rewards, drawing inspiration from animal learning research. Let's pretend that Dr. Watkins never discovered Q-learning, and that we don't even know about value functions.

Suppose we have some rule for grading what we've seen so far (i.e., some computable utility function u – not necessarily bounded – over action-observation histories h). h1:m just means everything we see between times 1 and m, and h<t:=h1:t−1. The agent has model p of the world. AUP's general formulation defines the agent's ability to satisfy that grading rule as the attainable utility

Qu(h<tat)=∑otmaxat+1∑ot+1⋯maxam∑omu(h1:m)∏mk=tp(ok|h<kak).



Strangely, I didn't consider the similarities with standard discounted-reward Q-values until several months after the initial formulation. Rather, the inspiration was AIXI's expectimax, and to my mind it seemed a tad absurd to equate the two concepts.

Having just proposed AUP in this alternate timeline, we're thinking about what it means to take optimal actions for an agent maximizing utility from time 1 to m. Clearly, we take the first action of the optimal plan over the remaining steps.

If we assume that u is additive (as is the case for the Markovian reward functions considered by Dr. Watkins), how does the next action we take affect the attainable utility value? Well, acting optimally is now equivalent to choosing the action with the best attainable utility value – in other words, greedy hill-climbing in our attainable utility space.

a ∗ t = a r g m a x a t E o t | h < t a t [ u ( h t ) + max a t + 1 ∑ o t + 1 ⋯ max a m ∑ o m u ( h t + 1 : m ) m ∏ k = t + 1 p ( o k | h < k a k ) ] = a r g m a x a t ∑ o t max a t + 1 ∑ o t + 1 ⋯ max a m ∑ o m u ( h t : m ) m ∏ k = t p ( o k | h < k a k ) = a r g m a x a t Q u ( h < t a t )

The remaining complication is that this agent is only maximizing over a finite horizon. If we can figure out discounting, all we have to do is find a tractable way of computing these discounted Q-values.

Q γ u ( h < t a t ) = E o t | h < t a t [ u ( h t ) + max a t + 1 γ Q γ u ( h < t + 1 a t + 1 ) ]

It requires no great leap of imagination to see that we could learn them.

A Personal Digression

I poured so much love and so many words into Towards a New Impact Measure that I hurt my wrists. For some time after, my typing abilities were quite limited; it was only thanks to the generous help of my friends (in particular, John Maxwell) and family (my mother let me dictate an entire paper in LATEX to her) that I was roughly able to stay on pace. Thankfully, physical therapy and newfound dictation software have brightened my prospects.

Take care of your hands. Very little time passed between "I'm having the time of my life" and " ow ". Actions you can take right now:

I'm currently sitting on book reviews for Computability and Logic and Understanding Machine Learning, with partial progress on several more. There are quite a few posts I plan to make about AUP, including:

exploration of the fundamental intuitions and ideas

dissection of why design choices are needed, shining light onto how, why, and where counterintuitive behavior arises

solution of problems open at the time of the initial post, including questions of penalizing prefixes, time ontologies, and certain sources of noise

chronicle of AUP's discovery

proposal of a scheme for using AUP to accomplish a pivotal act

discussion of my present research directions (which I have affectionately dubbed "Limited Agent Foundations" 2 ), sharing my thoughts on a potential thread uniting questions of mild optimization, low impact, and corrigibility

My top priority will be clearing away the varying degrees of confusion my initial post caused. I tried to cover too much too quickly; as a result of my mistake, I believe that few people viscerally grasped the core idea I was trying to hint at.

1 I'm fairly sure that the N=90 Sushi clinginess result is an artifact of the online learning process I used; the learned attainable set Q-values consistently produce good behavior for planning agents with that budget. Furthermore, the Sokoban average performance of .45 (14/20 successes) strikes me as low, and I expect the final results to be better. Either way, I'll update this post once I'm back at university.

(Since I anticipate running further experiments, the “Results” section is rather empty at the moment.)

ETA: this was indeed the case. The linked paper has been updated; the original is here.

2 Not to be taken as any form of endorsement by MIRI.