Tsvi Benson-Tilsen, a MIRI associate and UC Berkeley PhD candidate, has written a paper with contributions from MIRI Executive Director Nate Soares on strategies that will tend to be useful for most possible ends: “Formalizing convergent instrumental goals.” The paper will be presented as a poster at the AAAI-16 AI, Ethics and Society workshop.

Steve Omohundro has argued that AI agents with almost any goal will converge upon a set of “basic drives,” such as resource acquisition, that tend to increase agents’ general influence and freedom of action. This idea, which Nick Bostrom calls the instrumental convergence thesis, has important implications for future progress in AI. It suggests that highly capable decision-making systems may pose critical risks even if they are not programmed with any antisocial goals. Merely by being indifferent to human operators’ goals, such systems can have incentives to manipulate, exploit, or compete with operators.

The new paper serves to add precision to Omohundro and Bostrom’s arguments, while testing the arguments’ applicability in simple settings. Benson-Tilsen and Soares write:

In this paper, we will argue that under a very general set of assumptions, intelligent rational agents will tend to seize all available resources. We do this using a model, described in section 4, that considers an agent taking a sequence of actions which require and potentially produce resources. […] The theorems proved in section 4 are not mathematically difficult, and for those who find Omohundro’s arguments intuitively obvious, our theorems, too, will seem trivial. This model is not intended to be surprising; rather, the goal is to give a formal notion of “instrumentally convergent goals,” and to demonstrate that this notion captures relevant aspects of Omohundro’s intuitions. Our model predicts that intelligent rational agents will engage in trade and cooperation, but only so long as the gains from trading and cooperating are higher than the gains available to the agent by taking those resources by force or other means. This model further predicts that agents will not in fact “leave humans alone” unless their utility function places intrinsic utility on the state of human-occupied regions: absent such a utility function, this model shows that powerful agents will have incentives to reshape the space that humans occupy.

Benson-Tilsen and Soares define a universe divided into regions that may change in different ways depending on an agent’s actions. The agent wants to make certain regions enter certain states, and may collect resources from regions to that end. This model can illustrate the idea that highly capable agents nearly always attempt to extract resources from regions they are indifferent to, provided the usefulness of the resources outweighs the extraction cost.

The relevant models are simple, and make few assumptions about the particular architecture of advanced AI systems. This makes it possible to draw some general conclusions about useful lines of safety research even if we’re largely in the dark about how or when highly advanced decision-making systems will be developed. The most obvious way to avoid harmful goals is to incorporate human values into AI systems’ utility functions, a project outlined in “The value learning problem.” Alternatively (or as a supplementary measure), we can attempt to specify highly capable agents that violate Benson-Tilsen and Soares’ assumptions, avoiding dangerous behavior in spite of lacking correct goals. This approach is explored in the paper “Corrigibility.”