MIRI Research Fellow Andrew Critch has developed a new result in the theory of conflict resolution, described in “Toward negotiable reinforcement learning: Shifting priorities in Pareto optimal sequential decision-making.”

Abstract:

Existing multi-objective reinforcement learning (MORL) algorithms do not account for objectives that arise from players with differing beliefs. Concretely, consider two players with different beliefs and utility functions who may cooperate to build a machine that takes actions on their behalf. A representation is needed for how much the machine’s policy will prioritize each player’s interests over time. Assuming the players have reached common knowledge of their situation, this paper derives a recursion that any Pareto optimal policy must satisfy. Two qualitative observations can be made from the recursion: the machine must (1) use each player’s own beliefs in evaluating how well an action will serve that player’s utility function, and (2) shift the relative priority it assigns to each player’s expected utilities over time, by a factor proportional to how well that player’s beliefs predict the machine’s inputs. Observation (2) represents a substantial divergence from naïve linear utility aggregation (as in Harsanyi’s utilitarian theorem, and existing MORL algorithms), which is shown here to be inadequate for Pareto optimal sequential decision-making on behalf of players with different beliefs.

If AI alignment is as difficult as it looks, then there are already strong reasons for different groups of developers to collaborate and to steer clear of race dynamics: the difference between a superintelligence aligned with one group’s values and a superintelligence aligned with another group’s values pales compared to the difference between any aligned superintelligence and a misaligned one. As Seth Baum of the Global Catastrophic Risk Institute notes in a recent paper:

Unfortunately, existing messages about beneficial AI are not always framed well. One potentially counterproductive frame is the framing of strong AI as a powerful winner-takes-all technology. This frame is implicit (and sometimes explicit) in discussions of how different AI groups might race to be the first to build strong AI. The problem with this frame is that it makes a supposedly dangerous technology seem desirable. If strong AI is a winner-takes-all technologies race, then AI groups will want to join the race and rush to be the first to win. This is exactly the opposite of what the discussions of strong AI races generally advocate—they postulate (quite reasonably) that the rush to win the race could compel AI groups to skimp on safety measures, thereby increasing the probability of dangerous outcomes. Instead of framing strong AI as a winner-takes-all race, those who are concerned about this technology should frame it as a dangerous and reckless pursuit that would quite likely kill the people who make it. AI groups may have some desire for the power that might accrue to whoever builds strong AI, but they presumably also desire to not be killed in the process.

Researchers’ discussion of mechanisms to disincentivize arms races should therefore not be read as implying that self-defeating arms races are rational. Empirically, however, developers have a wide range of beliefs about the difficulty of alignment. Mechanisms for formally resolving policy disagreements may help create more evident incentives for cooperation and collaboration; hence there may be some value in developing formal mechanisms that advanced AI systems can use to generate policies that each party prefers over simple compromises between all parties’ goals (and beliefs), and that each prefers over racing.

Critch’s recursion relation provides a framework in which players may negotiate for the priorities of a jointly owned AI system, producing a policy that is more attractive than the naïve linear utility aggregation approaches already known in the literature. The mathematical simplicity of the result suggests that there may be other low-hanging fruit in this space that would add to and further illustrate the value of collaboration. Critch identifies six areas for future work (presented in more detail in the paper):

Best-alternative-to-negotiated-agreement dominance. Critch’s result considers negotiations between agents with differing beliefs, but does not account for the possibility that parties may have different BATNAs. Targeting specific expectation pairs. A method for modifying the players’ utility functions to make this possible would be useful for specifying various fairness or robustness criteria, including BATNA dominance. Information trade. Critch’s algorithm gives a large advantage to any contributor that is better able to predict the AI system’s inputs from its outputs. In realistic setting where players lack common knowledge of each other’s priors and observations, it would therefore make sense for agents to be able to trade away some degree of control over the system for information; but it is not clear how one should carry out such trades in practice. Learning priors and utility functions. Realistic smarter-than-human AI systems will need to learn their utility function over time, e.g., through cooperative inverse reinforcement learning. A realistic negotiation procedure will need to account for the fact that the developers’ goals are imperfectly known and the AI system’s goals are a “work in progress.” Incentive compatibility. The methods used to learn players’ beliefs and utility functions additionally need to incentivize honest representations of one’s beliefs and goals, or they will need to be robust to attempts to game the system. Naturalized decision theory. The setting used in this result assumes a separation between the inner workings of the machine (and the players) and external reality, as opposed to modeling it as part of its environment. More realistic formal frameworks would allow us to better model the players’ representations of each other, opening up new negotiation possibilities.