Machine learning (ML) algorithms can already recognize patterns far better than the humans they’re working for. This allows them to generate predictions and make decisions in a variety of high-stakes situations. For example, electricians use IBM Watson’s predictive capabilities to anticipate clients’ needs; Uber’s self-driving system determines what route will get passengers to their destination the fastest; and Insilico Medicine leverages its drug discovery engine to identify avenues for new pharmaceuticals.

As data-driven learning systems continue to advance, it would be easy enough to define “success” according to technical improvements, such as increasing the amount of data algorithms can synthesize and, thereby, improving the efficacy of their pattern identifications. However, for ML systems to truly be successful, they need to understand human values. More to the point, they need to be able to weigh our competing desires and demands, understand what outcomes we value most, and act accordingly.

Understanding Values

In order to highlight the kinds of ethical decisions that our ML systems are already contending with, Kaj Sotala, a researcher in Finland working for the Foundational Research Institute, turns to traffic analysis and self-driving cars. Should a toll road be used in order to shave five minutes off the commute, or would it be better to take the longer route in order to save money?

Answering that question is not as easy as it may seem.

For example, Person A may prefer to take a toll road that costs five dollars if it will save five minutes, but they may not want to take the toll road if it costs them ten dollars. Person B, on the other hand, might always prefer taking the shortest route regardless of price, as they value their time above all else.

In this situation, Sotala notes that we are ultimately asking the ML system to determine what humans value more: Time or money. Consequently, what seems like a simple question about what road to take quickly becomes a complex analysis of competing values. “Someone might think, ‘Well, driving directions are just about efficiency. I’ll let the AI system tell me the best way of doing it.’ But another person might feel that there is some value in having a different approach,” he said.

While it’s true that ML systems have to weigh our values and make tradeoffs in all of their decisions, Sotala notes that this isn’t a problem at the present juncture. The tasks that the systems are dealing with are simple enough that researchers are able to manually enter the necessary value information. However, as AI agents increase in complexity, Sotala explains that they will need to be able to account for and weigh our values on their own.

Understanding Utility-Based Agents

When it comes to incorporating values, Sotala notes that the problem comes down to how intelligent agents make decisions. A thermostat, for example, is a type of reflex agent. It knows when to start heating a house because of a set, predetermined temperature — the thermostat turns the heating system on when it falls below a certain temperature and turns it off when it goes above a certain temperature. Goal-based agents, on the other hand, make decisions based on achieving specific goals. For example, an agent whose goal is to buy everything on a shopping list will continue its search until it has found every item.

Utility-based agents are a step above goal-based agents. They can deal with tradeoffs like the following: Getting milk is more important than getting new shoes today. However, I’m closer to the shoe store than the grocery store, and both stores are about to close. I’m more likely to get the shoes in time than the milk.” At each decision point, goal-based agents are presented with a number of options that they must choose from. Every option is associated with a specific “utility” or reward. To reach their goal, the agents follow the decision path that will maximize the total rewards.

From a technical standpoint, utility-based agents rely on “utility functions” to make decisions. These are formulas that the systems use to synthesize data, balance variables, and maximize rewards. Ultimately, the decision path that gives the most rewards is the one that the systems are taught to select in order to complete their tasks.

While these utility programs excel at finding patterns and responding to rewards, Sotala asserts that current utility-based agents assume a fixed set of priorities. As a result, these methods are insufficient when it comes to future AGI systems, which will be acting autonomously and so will need a more sophisticated understanding of when humans’ values change and shift.

For example, a person may always value taking the longer route to avoid a highway and save money, but not if they are having a heart attack and trying to get to an emergency room. How is an AI agent supposed to anticipate and understand when our values of time and money change? This issue is further complicated because, as Sotala points out, humans often value things independently of whether they have ongoing, tangible rewards. Sometimes humans even value things that may, in some respects, cause harm. Consider an adult who values privacy but whose doctor or therapist may need access to intimate and deeply personal information — information that may be lifesaving. Should the AI agent reveal the private information or not?

Ultimately, Sotala explains that utility-based agents are too simple and don’t get to the root of human behavior. “Utility functions describe behavior rather than the causes of behavior….they are more of a descriptive model, assuming we already know roughly what the person is choosing.” While a descriptive model might recognize that passengers prefer saving money, it won’t understand why, and so it won’t be able to anticipate or determine when other values override “saving money.”

An AI Agent Creates a Queen

At its core, Sotala emphasizes that the fundamental problem is ensuring that AI systems are able to uncover the models that govern our values. This will allow them to use these models to determine how to respond when confronted with new and unanticipated situations. As Sotala explains, “AIs will need to have models that allow them to roughly figure out our evaluations in totally novel situations, the kinds of value situations where humans might not have any idea in advance that such situations might show up.”

In some domains, AI systems have surprised humans by uncovering our models of the world without human input. As one early example, Sotala references research with “word embeddings” where an AI system was tasked with classifying sentences as valid or invalid. In order to complete this classification task, the system identified relationships between certain words. For example, as the AI agent noticed a male/female dimension to words, it created a relationship that allowed it to get from “king” to “queen” and vice versa.

Since then, there have been systems which have learned more complex models and associations. For example, OpenAI’s recent GPT-2 system has been trained to read some writing and then write the kind of text that might follow it. When given a prompt of “For today’s homework assignment, please describe the reasons for the US Civil War,” it writes something that resembles a high school essay about the US Civil War. When given a prompt of “Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry,” it writes what sounds like Lord of the Rings-inspired fanfiction, including names such as Aragorn, Gandalf, and Rivendell in its output.

Sotala notes that in both cases, the AI agent “made no attempt of learning like a human would, but it tried to carry out its task using whatever method worked, and it turned out that it constructed a representation pretty similar to how humans understand the world.”

There are obvious benefits to AI systems that are able to automatically learn better ways of representing data and, in so doing, develop models that correspond to humans’ values. When humans can’t determine how to map, and subsequently model, values, AI systems could identify patterns and create appropriate models by themselves. However, the opposite could also happen — an AI agent could construct something that seems like an accurate model of human associations and values but is, in reality, dangerously misaligned.

For instance, suppose an AI agent learns that humans want to be happy, and in an attempt to maximize human happiness, it hooks our brains up to computers that provide electrical stimuli that gives us feelings of constant joy. In this case, the system understands that humans value happiness, but it does not have an appropriate model of how happiness corresponds to other competing values like freedom. “In one sense, it’s making us happy and removing all suffering, but at the same time, people would feel that ‘no, that’s not what I meant when I said the AI should make us happy,’” Sotala noted.

Consequently, we can’t rely on an agent’s ability to uncover a pattern and create an accurate model of human values from this pattern. Researchers need to be able to model human values, and model them accurately, for AI systems.

Developing a Better Definition

Given our competing needs and preferences, it’s difficult to model the values of any one person. Combining and agreeing on values that apply universally to all humans, and then successfully modeling them for AI systems, seems like an impossible task. However, several solutions have been proposed, such as inverse reinforcement learning or attempting to extrapolate the future of humanity’s moral development. Yet, Sotala notes that these solutions fall short. As he articulated in a recent paper, “none of these proposals have yet offered a satisfactory definition of what exactly human values are, which is a serious shortcoming for any attempts to build an AI system that was intended to learn those values.”

In order to solve this problem, Sotala developed an alternative, preliminary definition of human values, one that might be used to design a value learning agent. In his paper, Sotala argues that values should be defined not as static concepts, but as variables that are considered separately and independently across a number of situations in which humans change, grow, and receive “rewards.”

Sotala asserts that our preferences may ultimately be better understood in terms of evolutionary theory and reinforcement learning. To justify this reasoning, he explains that, over the course of human history, people evolved to pursue activities that are likely to lead to certain outcomes — outcomes that tended to improve our ancestors’ fitness. Today, he notes that human still prefer those outcomes, even if they no longer maximize our fitness. In this respect, over time, we also learn to enjoy and desire mental states that seem likely to lead to high-reward states, even if they do not.

So instead of a particular value directly mapping onto a rewards, our preferences map onto our expectation of rewards.

Sotala claims that the definition is useful when attempting to program human values into machines, as value learning systems informed by this model of human psychology would understand that new experiences can change which states a person’s brain categorizes as “likely to lead to reward.” Summing Sotala’s work, the Machine Intelligence Research Institute outlined the benefits to this framing. “Value learning systems that take these facts about humans’ psychological dynamics into account may be better equipped to take our likely future preferences into account, rather than optimizing for our current preferences alone,” they said.

This form of modeling values, Sotala admits, is not perfect. First, the paper is only a preliminary stab at defining human values, which still leaves a lot of details open for future research. Researchers still need to answer empirical questions related to things like how values evolve and change over time. And once all the empirical questions are answered, researchers need to contend with the philosophical questions that don’t have an objective answer, like how those values should be interpreted and how they should guide an AGI’s decision-making.

When addressing these philosophical questions, Sotala notes that the path forward may simply be to get as much of a consensus as possible. “I tend to feel that there isn’t really any true fact of which values are correct and what would be the correct way of combining them,” he explains. “Rather than trying to find an objectively correct way of doing this, we should strive to find a way that as many people as possible could agree on.”

Since publishing this paper, Sotala has been working on a different approach for modeling human values, one that is based on the premise of viewing humans as multiagent systems. This approach has been published as a series of Less Wrong articles. There is also a related, but separate, research agenda by Future of Humanity Institute’s Stuart Armstrong, which focuses on synthesizing human preferences into a more sophisticated utility function.