The question of ‘value alignment’ centres upon how to ensure that AI systems are properly aligned with human values. It can be broken down into two parts. The first part is technical and focuses on how to encode values or principles in artificial agents, so that they reliably do what they ought to do. The second part is normative, and focuses on what values or principles it would be right to encode in AI.

This paper focuses on the second question, paying particular attention to the fact that we live in a pluralistic world where people have a variety of different beliefs about value. Ultimately, I suggest that we need to devise principles for alignment that treat people fairly and command widespread support despite this difference of opinion.

Moral considerations

Any new technology generates moral considerations. Yet the task of imbuing artificial agents with moral values becomes particularly important as computer systems operate with greater autonomy and at a speed that ‘increasingly prohibits humans from evaluating whether each action is performed in a responsible or ethical manner’.

The first part of the paper notes that while technologists have an important role to play in building systems that respect and embody human values, the task of selecting appropriate values is not one that can be settled by technical work alone. This becomes clear when we look at the different ways in which value alignment could be achieved, at least within the reinforcement learning paradigm.

One set of approaches try to specify a reward function for an agent that would lead it to promote the right kind of outcome and act in ways that are broadly thought to be ethical. For this approach to succeed, we need to specify appropriate goals for artificial agents and encode them in AI systems – which is far from straightforward. A second family of approaches proceeds differently. Instead of trying to specify the correct reward function for the agent upfront, it looks at ways in which an agent could learn the correct reward from examples of human behavior or human feedback. However, the question then becomes what data or feedback to train the agent on – and how this decision can be justified.

Either way, important normative questions remain.



Alignment with what?

A key concern among AI researchers is that the systems they build are properly responsive to human direction and control. Indeed, as Stuart Russell notes, it is important that artificial agents understand the real meaning of the instructions they are given, and that they do not interpret them in an excessively literal way – with the story of King Midas serving as a cautionary tale.

At the same time, there is growing recognition that AI systems may need to go beyond this – and be designed in a way that leads them to do the right thing by default, even in the absence of direct instructions from a human operator.

One promising approach holds that AI should be designed to align with human preferences. In this way, AI systems would learn to avoid outcomes that very few people wanted or desired. However, this approach also has certain weaknesses. Revealed preferences can be irrational or based on false information. They may also be malicious. Furthermore, preferences are sometimes ‘adaptive’: people who lead lives affected by poverty or discrimination may revise their hopes and expectations downwards in order to avoid disappointment. By aligning itself with existing human preferences, AI could therefore come to act on data that is heavily compromised.

To address this weakness, I suggest that AI systems need to be properly responsive to underlying human interests and values. A principle-based approach to AI alignment, which takes into account both of these factors, would yield agents that are less likely to do harm and more likely to promote human well-being. A principle-based approach to alignment could also be sensitive to other considerations, such as the welfare of future generations, non-human animals and the environment.

Three approaches

The final part of the paper looks at the ways in which principles for AI alignment might be identified.



In this context, I suggest that the main challenge is not to identify ‘true’ moral principles and encode them in AI – for even if we came to have great confidence in the truth of a single moral theory there would still be people with different beliefs and opinions who disagreed with us. Instead, we should try to identify principles for alignment that are acceptable to people who ascribe to a wide range of reasonable points of view. Principles of this kind could be arrived at in at least three different ways.