Crossposted from the AI Alignment Forum . May contain more technical jargon than usual.

We analyze the usefulness of the framework of preference types [Berridge et al. 2009] to value learning by an artificial intelligence. In the context of AI the purpose of value learning is giving an AI goals aligned with humanity. We will lay the groundwork for establishing how human preferences of different types are (descriptively) or ought to be (normatively) aggregated.

This blogpost (1) describes the framework of preference types and how these can be inferred, (2) considers how an AI could aggregate our preferences, and (3) suggests how to choose an aggregation method. Lastly, we consider potential future directions that could be taken in this area.

Motivation

The reason that the concept of multiple preference types is useful for AI is that people often have internal conflicts. Examples of internal conflicts:

You are on a diet and prefer not to eat fattening food, but still really enjoy it.

Addicts that no longer enjoy their drug.

A friend who says he doesn’t want your help, but does.

In conversation with someone racist, you may have conflicting goals. On one hand, you want them to change their mind and not be racist. On the other hand, you want to make them feel bad about their racism. Unfortunately, if you act on the latter, you are unlikely to achieve the former.

Anything you would like to do, but have trouble staying engaged with; for example, you may enjoy exercise or learning a language, and have a goal of getting fit or being fluent, but not be motivated to go for a run or open a textbook.

We think that these internal conflicts can be understood as us having preferences of different types that compete with one another. If an AI ignores the fact that we can have competing preferences, then when it considers a state it will only infer our preference for that state based on one proxy, which will often leave the AI with an incomplete picture. Examples in which taking into account only one data source for preferences leads to complications:

In the case that the AI only takes into account what humans say they want, there is the problem of people lying, leaving out details (because they haven’t thought of them), or not knowing their preferences.

In the case of inverse reinforcement learning (IRL) the AI infers preferences based on behavior only. In the diet example, if people ‘have little willpower’ then an AI using IRL may incorrectly infer that the humans preferred to eat as much sugar as possible.

Anticipated application of the approach:

Identifying and aggregating different preference types could help with the value learning problem, where value learning is “making AI’s goals more accurately aligned with human goals”. It could help with value learning in one of the following ways:

Indirectly, via a descriptive model: By understanding humans better (through understanding how they aggregate their preferences) the AI could better form goals that align with how humans behave.

Directly, via a normative model: By understanding how preferences of different types, and different data sources on human preferences, should be aggregated, the AI could better form goals that align with human goals.

Some very concrete examples of where our approach would be used are:

A personal assistant robot has access to several sources of information regarding what its employer prefers. It has to make sense of conflicting signals, for example when its employer is on a diet, but still really enjoys sugary foods.

Much focus has been on making education more enjoyable, but research suggests that in order to help students achieve their goals it is important to appeal to their motivational systems, which can be separate from their enjoyment.

Framework: Liking, Wanting and Approving

In this post we focus on three specific preference types that we think are valid and distinct from each other. We work with the preference framework of liking, wanting and approving, which are defined as follows:

Liking is experienced pleasure. In the brain liking happens through endorphins.

Wanting is what triggers you to do something, it makes you seek something out, it always has to precede acting. We are using ‘wanting’ here as in incentive salience wanting [Berridge et al. 2014], as opposed to how ‘wanting’ is used in daily life. ‘Desire’ is the word we use for what ‘wanting’ is usually used for. Physiologically, wanting is regulated by dopamine.

Approving has to do with reasoning and rationalizing. What you think you should do, approval of the behaviour, especially viewing it as ‘in line with one’s self image’ (“ego-syntonic”) or based in achieving one’s goals.

The following examples are inspired by this blogpost [Alexander 2011].

+liking/+wanting/+approving: Experiencing love.

+liking/+wanting/-approving: Eating chocolate.

+liking/-wanting/+approving: Many hobbies (are enjoyable and although people approve of them they can rarely bring themselves to do it).

+liking/-wanting/-approving: Eating foie gras.

-liking/+wanting/+approving: Running just before the runner’s high.

-liking/+wanting/-approving: Addicts that no longer enjoy their drug.

-liking/-wanting/+approving: Working.

-liking/-wanting/-approving: Torture.

There are several motivations for our choice of the framework of liking, wanting and approving. Firstly, for each combination of positive or negative liking, wanting or approving, there is an example that fits the combination, so they are independent. Secondly, we (humans) use data on body language, stated preferences and actions to form our theory of mind of other people. Lastly, there is a large body of research on these preference types, which makes it easier to work with them.

Relations between preference types:

These preferences are distinct [Berridge et al. 2009], but can influence one another [van Gelder 1998]. For example, it may in general cause you more pleasure to do something you approve of. We would like to see a comprehensive, descriptive model of preference types in humans in cognitive science. These interactions could for example be modeled as a dynamical system [van Gelder 1998].

The observables:

Liking, wanting and approving are for the most part hidden processes. They are not directly observable, but they influence observable behaviors. As a proxy for liking we propose to use facial expressions, body language or responses to questionnaires. Although a cognitive scan may be the most accurate proxy for liking, there is evidence to suggest both that facial expressions and body language are indicators of pleasure and pain [Algom et al. 1994] and that they can be classified well enough to make them technically feasible proxies [Giorgiana et al. 2012]. The observable proxy of wanting is revealed preferences. We propose to encode a proxy for approval via stated preferences.

Extracting reward functions from the data sources:

In order to aggregate (however we choose to do that), we need to make sure the preferences are encoded in commensurable ways. To make our approach attuned to reinforcement learning we propose to encode the preferences as reward functions. For this we need to collect human data in a specific environment with well-defined states, in order to ensure all three sets of data refer to (different) preference types about the same state, and then normalise.

Liking: Have people act in the environment and record and classify their facial expressions. This function of states to facial expressions can then be simplified to a function from states to real numbers.

Wanting: Have people act in the environment and apply inverse reinforcement learning to infer a revealed preferences reward function.

Approving: Have a list of states and ask people to attach a real number to that state to signify how much they approve of the state.

Examples of collecting data:

Personal assistant: Define states of the living room.

Record facial expressions of the human as the person goes about their life.

Observe behavior and use IRL to infer a wanting reward function.

Ask the human how much they approve of each state of the living room.

Taxi-driver dataset: Define states of taxi-drivers.

Record facial expressions of taxi-drivers as they do their job.

Observe behavior and use IRL to infer a wanting reward function. IRL is already being applied to taxi-driver data.

Ask drivers how much they approve of each state.

The reward functions extracted should be renormalized, to make them commensurable.

We have considered other preference types as well, such as disgust, serotonin, oxytocin, rationalizing and remembered preferences, see this doc.

Aggregating Preferences

Our initial approach to establishing a method for aggregation of preference types was to find desiderata any potential aggregation function should comply with.

As a source of desiderata, we examined existing bodies of research that dealt with aggregating preferences, either across individuals or between different types. We looked at the following fields and corresponding desiderata:

Economics & Social Welfare Theory: Examining how these fields maximised utility across the preferences of different individuals highlighted pareto-efficiency and inequality-aversion.

Social Choice Theory: Arrow’s impossibility theorem gives three “fairness criteria” for an electoral system. They are non-dictatorship, universality and independence from irrelevant alternatives

Constitutional Law: The idea of a constitution gives a veto-right to resilient or pre-set preferences over transitory ones.

Moral Philosophy: Using utilitarianism as analogous to providing a utility function, we considered the respective limitations of hedonistic utilitarianism (analogous to liking) and preference utilitarianism (analogous to approval). Rather than provide additional desiderata, this would influence how those identified are prioritised by the end-user.

To illustrate what we mean by desiderata for aggregating, and aggregation methods, and how these could be used with the preference types framework, we have the following examples.

Desiderata for aggregating:

Order-preserving, pareto-efficiency, or unanimity: For any states s 1 and s 2 : If R L ( s 1 ) > R L ( s 2 ) , R W ( s 1 ) > R W ( s 2 ) , R A ( s 1 ) > R A ( s 2 ) , then R ( s 1 ) > R ( s 2 ) .

and : If , then . Veto-right: The approval function, and/or a set of values considered the constitution, can veto the aggregate. The spirit of non-dictatorship and veto-right are opposite. However, they are not mutually exclusive as long as the veto-right does not always apply.

Inequality-aversion: The different functions should be kept within a certain range of each other.

The approach of loaning is used to provide initial desiderata for inspiration, but for an in-depth analysis, it is not generalizable.

Aggregation methods:

We now consider some specific aggregation methods and see whether they satisfy the desiderata.

Setting the aggregate R to R W gives a descriptive model of how humans act. A normative model can be obtained by setting the aggregate to R A . Both of these aggregates are order-preserving and have a veto property, but lack inequality aversion.

to gives a descriptive model of how humans act. A normative model can be obtained by setting the aggregate to . Both of these aggregates are order-preserving and have a veto property, but lack inequality aversion. Defining the aggregate for each state as R ( s ) = − ( 100 − R L ( s ) ) 2 − ( 100 − R W ( s ) ) 2 − ( 100 − R A ( s ) ) 2 , yields a function that is order-preserving, does not have a veto property, but satisfies inequality aversion.

yields a function that is order-preserving, does not have a veto property, but satisfies inequality aversion. First taking the minimum until some threshold is reached, then taking R A until a higher threshold is reached, and only if the threshold is reached taking the average, gives an aggregate that satisfies all three desiderata.

Example:

Consider the situation where Bob is on a diet, but still has a sweet tooth. He can be in the following states {s1 = on a diet, saw sugar and ate sugar, s2 = on a diet, saw sugar and did not eat sugar, s3 = on a diet, did not see sugar and did not eat it}.

Bob has the following reward functions:

Liking reward function: R L ( s 1 ) = 5 , R L ( s 2 ) = 0 , R L ( s 3 ) = 0

Wanting reward function: R W ( s 1 ) = 20 , R W ( s 2 ) = − 20 , R W ( s 3 ) = 0

Approving reward function: R A ( s 1 ) = − 10 , R A ( s 2 ) = 5 , R A ( s 3 ) = 5

Define aggregate functions:

R 1 ( s ) = R W ( s )

R 2 ( s ) = − ( 100 − R L ( s ) ) 2 − ( 100 − R W ( s ) ) 2 − ( 100 − R A ( s ) ) 2

R 3 ( s ) is equal to min ( R L ( s ) , R W ( s ) , R A ( s ) ) until a threshold of -5 is reached, then taking R A until a threshold of 0 is reached, and then taking the average.

Then s1 maximizes R1 and R2, and s3 maximizes R3.

Choosing an Aggregation Method

As the choice of aggregation method will depend on the particular scenario, it should be determined on a case-by-case basis.

Some useful approaches would be:

Asking people for their meta-preferences, i.e. their preferences regarding how their reward functions should be aggregated between preference types.

Importance of desiderata to the end-user, also based on meta-preferences e.g. the school of ethics the end-user adheres to.

Letting the accuracy of measurement of a separate reward function decide how much it should be weighted.

Implementing a sensible aggregation function and let an AI act in a relatively undangerous environment and change the aggregation function as desired. This is similar to, but more substantial than, coming up with mock input for the AI and simulating what actions it would take.

Identifying a more complete preference type, which may be difficult but not impossible to collect data on. Data on this preference type can be used to fit an aggregation function. An example of a more complete preference type is satisfaction in hindsight. Another example is how close a mental state is to the state of flow. In a way, identifying more complete preferences poses the question: which preference gets to govern the others - that is, which one does the person most strongly identify with?

For a more general approach to the problem, future work could:

Create an ideal descriptive function that perfectly models how humans actually aggregate preferences. Create a limited set of normative functions that could be easily applied by researchers with different priorities. Provide guidelines and resources for drafting new aggregation functions that use preference types.

Final Remarks

Applicability:

This approach is useful, even with an unsophisticated aggregation method:

Adding more proxies for true happiness lowers the impact of Goodhart’s law.

If there is only one proxy, then it is difficult for the AI to know when it is wrong about the measurement of the proxy. Adding in more proxies allows the AI to classify all measurements where the proxies are in conflict as potentially wrong.

Some actions the AI could take in the case of conflicting preferences are:

Asking the human for confirmation, to refine error detection in measuring any of the preference types.



Helping the human with introspection, for example through clarification or debate.

Future work:

Helpful next steps for any researchers that would like to take on the project would seem to be:

Small-scale questionnaire-based data-collection, to properly try out the model. Extract reward functions and aggregate them. Have a reinforcement learner optimize for the aggregate and see if the emergent behavior is desired.

A literature review and consideration of how existing research in cognitive science treats the interaction between preference types is needed. This would help with forming a descriptive aggregation model.

Other directions:

Some cognitive biases suggest that we should discount for different preference types differently. For example, diversification bias [Read et al. 1999] indicates individuals prioritise ‘liking’ over a shorter timeframe and ‘approving’ over a longer one.

If we want to allow the separate reward functions to have different discounting factors, then they can not be aggregated into one reward function, unless we include time as a state-feature.

If we want to allow the separate reward functions to have different discounting factors, then they can not be aggregated into one reward function, unless we include time as a state-feature. Previous work [Baum 2012] has been done on aggregating preferences between different individuals. The preference types framework has the potential to enhance this by modelling how preferences of certain types in others influence our own preferences by type. To better understand these interactions, we can simulate them in a simplified model and observe the emergent behavior. We’ve conducted some initial work on this; please contact us if you are interested.

References

Kent C. Berridge, Terry E. Robinson, and J. Wayne Aldridge, Dissecting components of reward: 'liking', 'wanting', and learning, Curr Opin Pharmacol. Feb; 9(1): 65–73, 2009.

Kent C. Berridge, John P. O’Doherty, Experienced Utility to Decision Utility, in Neuroeconomics (Second Edition), 2014.

Scott Alexander, Approving Reinforces Low-Effort Behaviours, https://www.lesswrong.com/posts/yDRX2fdkm3HqfTpav/approving-reinforces-low-effort-behaviors, 2011.

Tim van Gelder, The Dynamical Hypothesis in Cognitive Science, Behav Brain Sci. Oct; 21(5):615-28; discussion 629-65, 1998.

Daniel Algom, Sonia Lubel, Psychophysics in the field: Perception and memory for labor pain, Perception & Psychophysics 55: 133. https://doi.org/10.3758/BF03211661, 1994.

Geovanny Giorgana, Paul G. Ploeger, Facial expression recognition for domestic service robots, in Robo Cup 2011: Robot Soccer World Cup XV, pp. 353-364, 2012.

Daniel Read, George Loewenstein, Shobana Kalyanaraman, Mixing virtue and vice: combining the immediacy effect and the diversification heuristic, Journal of Behavioral Decision Making; Dec; 12, 4; ABI/INFORM Global pg. 257, 1999.

Seth Baum, Social Choice Ethics in Artificial Intelligence, Forthcoming, AI & Society, DOI 10.1007/s00146-017-0760-1 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3046725, 2017.