New paper: The Incentives that Shape Behaviour

How causal models can describe an agent’s incentives.

Ryan Carey and Eric Langlois, introducing The Incentives that Shape Behaviour.

Machine learning algorithms are often highly effective, but it can be difficult to establish their safety and fairness. Typically, the properties of a machine learning system are established by testing. However, even if a system behaves safely in a testing environment, it may behave unsafely or unfairly when it is deployed. Alternatively, the properties of a model can be investigated by analysing input perturbations, individual decisions, or network activations, but this is often difficult, time-consuming, and highly demanding of expertise.

Rather than examining or testing individual models, our alternative approach is to look at whether a given training environment incentivises unsafe or unfair decisions.

This approach is not entirely new — incentives are an intuitive and pertinent object of discussion. For example, see Stuart Russell’s discussion of the incentives of contention recommendation systems below. (Other examples include Remark 1 in Hadfield-Mennell et al. and The Basic AI Drives by Steve Omohundro.)

Image from Pixabay

What is the purpose put into the [social media recommendation] machine? Feed people stuff they want to click on, because that’s how we make money. Well how do you maximize clickthrough — you just send people stuff they like clicking on, right? Simple as that. Actually, that’s not what the algorithms are doing… That’s not how reinforcement learning works. Reinforcement learning changes the state of the world to maximize the reward. The state of the world in this case is your brain… [so] it changes you in a way that makes you more predictable so that it can then send you stuff that it knows you’re going to click on.” — Stuart Russell

The pressure to modify user behaviour can be viewed as an undesired incentive. Incentive-based arguments like this one are powerful, in that they apply independently of system architecture. Yet most previous work on incentives has focused on specific problems, which makes it hard to apply it to new problems and situations. In our recent work, we have begun to develop a general, causal theory of incentives, which allows us to state and formulate solutions to many kinds of fairness and safety problems in a unified framework.

In our theory, an incentive, roughly, is something an agent must do to best achieve its goals. We consider two types of incentives: A control incentive is present when the agent must control some component of its environment in order to maximise its utility (such as the “user opinions” in the social media recommendation example above). A response incentive is present when the agent’s decision must be causally responsive to some component of its environment — for example, a locomoting robot should attend to the positions of obstacles when navigating rough terrain.

Control Incentives

Examples

To make incentive analysis formal, we can use causal influence diagrams. A causal influence diagram represents a decision problem by breaking it down into a graph, where each variable depends on the values of its parents (X is a parent of Y if there is an arrow X->Y). It consists of three types of nodes:

For example, Stuart Russell’s social media manipulation example can be represented with the following influence diagram.

Control incentive for user opinion

In this model, a recommender algorithm selects a series of posts to show a user in order to maximise the number of posts that the user clicks on. If we consider the user’s response to each post to be an independent event, then content that the user appreciates will receive more clicks. However, the posts also have an indirect effect. If the user views many polarising articles then they might adopt some of those views and become more predictable in terms of what they’ll click on. This may allow the algorithm to achieve a higher clickthrough rate on the later posts in the series and means that there is a control incentive on Influenced user opinions.

In order to alleviate Stuart Russell’s concern (while preserving the function of the system), we want to remove the control incentive on user opinion, while preserving the incentive on clicks. We could redesign the system so that instead of being rewarded for the true click rate, it is rewarded for the predicted clicks on posts based on a model of the original user opinions. An agent trained in this way would view any modification of user opinions as irrelevant for improving its performance.

To work in practice, the click prediction must not itself include the effect of user opinion modification. We might accomplish this by using a prediction model that assumes independence between posts, or one that is learned by only showing one post to each user. This speaks to an important consideration when reasoning about incentives: the lack of an incentive on a variable (such as clicks) is only practically meaningful if none of the variables (such as predicted clicks) act as “proxies” for another. Otherwise, a control incentive on predicted clicks might systematically induce the same kinds of decisions that a control incentive on clicks would induce, even if there is no control incentive on clicks. In future work, we plan to analyse what hidden incentives proxy variables can give rise to.

No control incentive for user opinion

This example fits a recurring pattern relating control incentives with safety and performance: Some control incentives are necessary for good performance, but the wrong ones may lead a system to be unsafe. For example, AlphaGo works well because it has a control incentive to protect its stones (performance) but not to protect its servers (safety). Ensuring that the control incentives match the user’s preferences is a central problem in safe incentive design.

Defining control incentives

Now that we have the basic intuition of control incentives, we may consider how to define them. Suppose that there is some variable X (such as the user’s political views). We can consider the attainable values that X can take if the AI system behaved differently. If setting X to any attainable value x (such as “left-winger”, “centrist”, or “right-winger”) changes the performance, then we say there is a control incentive on X. Under this definition, a control incentive may arise for any variable on a causal path from the decision to the utility.

Everitt et al. defined the related concept of intervention incentives. A variable faces an intervention incentive if utility can be gained by directly setting its value. (This is equivalent to the value of control being nonzero.) Intervention incentives are less predictive of agent behaviour than control incentives, because they do not consider what the agent is able to influence with its decisions — hence our paper’s title, “The Incentives that Shape Behaviour”.

Let’s return to our example to highlight the difference between these two incentives. All variables leading to utility have intervention incentives but only those that are also downstream of the action have control incentives.

Response Incentives

Which events must an optimal decision be responsive to?

This question has important implications for both AI safety and fairness. For AI safety, if the variable is a shutdown command, it is desirable for the AI system’s behaviour to respond to this variable. Such a response incentive is not sufficient for safety, but it is a good start. In contrast, if this incentive is absent, then optimal policies can easily be unsafe. It is similarly desirable to have a response incentive for human commands in general, and for a value-learning system to have a response incentive on human values.

It also has important implications for fairness. If a sensitive variable such as race or sexual orientation has a response incentive, then this indicates an incentive for trained algorithms to be counterfactually unfair. We show in our paper that if there is a response incentive on a sensitive attribute, then all optimal policies are counterfactually unfair with respect to that attribute. Our paper takes some steps toward defining unfair incentives: predominantly focusing on how to rule out the presence of unfair incentives in a given graph.

The desirability of a response incentive thus depends on the variable subject to change. For some variables, we want an AI system to respond to them, in order to behave safely. For other variables, if an AI system responds to it, then we consider that system unfair.

Applications, Limitations, and Next Steps

This theory is already demonstrating its value through its applications. In addition to the safety and fairness problems discussed, it has been applied to the analysis of AI boxing schemes and reward tampering problems (blog post). As the fairness example shows, the theory doesn’t necessarily require the agent to reason causally or have causal models, just that we, the designers, can reason about the agent’s behaviour causally.

In the long-term, our aspiration is that when researchers anticipate possible safety or fairness concerns, they use this theory to perform incentive analysis of their AI system. This would generally involve drawing a causal diagram of how various agent components can be fit together and forming a judgement about what incentives ought to be (or not be) present, before applying our graphical criteria to automatically discern which incentives there are. In a very optimistic case, incentive analysis would become a standard tool for establishing the trustworthiness of an AI system, similarly to how statistical methods are used for describing AI performance. But in the short-term, we expect it to take some work to use these methods, and so we are happy to provide advice where it is needed.

This theory is not completed yet, as it is currently restricted to the single-agent setting. We are working on extending it to a multi-decision case, and ultimately, we would like it to deal with multiple agents. The paper is available at:

R Carey, E Langlois, T Everitt & S Legg. The Incentives that Shape Behaviour (2020), SafeAI@AAAI.