Find all Alignment Newsletter resources here . In particular, you can sign up , or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (Julian Schrittwieser et al) (summarized by Nicholas): Up until now, model-free RL approaches have been state of the art at visually rich domains such as Atari, while model-based RL has excelled for games which require planning many steps ahead, such as Go, chess, and shogi. This paper attains state of the art performance on Atari using a model-based approach, MuZero, while matching AlphaZero ( AN #36 ) at Go, chess, and shogi while using less compute. Importantly, it does this without requiring any advance knowledge of the rules of the game.

MuZero's model has three components:

1. The representation function produces an initial internal state from all existing observations.

2. The dynamics function predicts the next internal state and immediate reward after taking an action in a given internal state.

3. The prediction function generates a policy and a value prediction from an internal state.

Although these are based on the structure of an MDP, the internal states of the model do not necessarily have any human-interpretable meaning. They are trained end-to-end only to accurately predict the policy, value function, and immediate reward. This model is then used to simulate trajectories for use in MCTS.

Nicholas's opinion: This is clearly a major step for model-based RL, becoming the state of the art on a very popular benchmark and enabling planning approaches to be used in domains with unknown rules or dynamics. I am typically optimistic about model-based approaches as progress towards safe AGI. They map well to how humans think about most complex tasks: we consider the likely outcomes of our actions and then plan accordingly. Additionally, model-based RL typically has the safety property that the programmers know what states the algorithm expects to pass through and end up in, which aids with interpretability and auditing. However, MuZero loses that property by using a learned model whose internal states are not constrained to have any semantic meaning. I would be quite excited to see follow up work that enables us to understand what the model components are learning and how to audit them for particularly bad inaccuracies.

Rohin's opinion: Note: This is more speculative than usual. This approach seems really obvious and useful in hindsight (something I last felt for population-based training of hyperparameters). The main performance benefit (that I see) of model-based planning is that it only needs to use the environment interactions to learn how the environment works, rather than how to act optimally in the environment -- it can do the "act optimally" part using some MDP planning algorithm, or by simulating trajectories from the world model rather than requiring the actual environment. Intuitively, it should be significantly easier to learn how an environment works -- consider how easy it is for us to learn the rules of a game, as opposed to playing it well. However, most model-based approaches force the learned model to learn features that are useful for predicting the state, which may not be the ones that are useful for playing well, which can handicap their final performance. Model-free approaches on the other hand learn exactly the features that are needed for playing well -- but they have a much harder learning task, so it takes many more samples to learn, but can lead to better final performance. Ideally, we would like to get the benefits of using an MDP planning algorithm, while still only requiring the agent to learn features that are useful for acting optimally.

This is exactly what MuZero does, similarly to this previous paper : its "model" only predicts actions, rewards, and value functions, all of which are much more clearly relevant to acting optimally. However, the tasks that are learned from environment interactions are in some sense "easier" -- the model only needs to predict, given a sequence of actions, what the immediate reward will be. It notably doesn't need to do a great job of predicting how an action now will affect things ten turns from now, as long as it can predict how things ten turns from now will be given the ten actions used to get there. Of course, the model does need to predict the policy and the value function (both hard and dependent on the future), but the learning signal for this comes from MCTS, whereas model-free RL relies on credit assignment for this purpose. Since MCTS can consider multiple possible future scenarios, while credit assignment only gets to see the trajectory that was actually rolled out, we should expect that MCTS leads to significantly better gradients and faster learning.

I'm Buck Shlegeris, I do research and outreach at MIRI, AMA (Buck Shlegeris) (summarized by Rohin): Here are some beliefs that Buck reported that I think are particularly interesting (selected for relevance to AI safety):

1. He would probably not work on AI safety if he thought there was less than 30% chance of AGI within 50 years.

2. The ideas in Risks from Learned Optimization ( AN #58 ) are extremely important.

3. If we build "business-as-usual ML", there will be inner alignment failures, which can't easily be fixed. In addition, the ML systems' goals may accidentally change as they self-improve, obviating any guarantees we had. The only way to solve this is to have a clearer picture of what we're doing when building these systems. (This was a response to a question about the motivation for MIRI's research agenda, and so may not reflect his actual beliefs, but just his beliefs about MIRI's beliefs.)

4. Different people who work on AI alignment have radically different pictures of what the development of AI will look like, what the alignment problem is, and what solutions might look like.

5. Skilled and experienced AI safety researchers seem to have a much more holistic and much more concrete mindset: they consider a solution to be composed of many parts that solve subproblems that can be put together with different relative strengths, as opposed to searching for a single overall story for everything.

6. External criticism seems relatively unimportant in AI safety, where there isn't an established research community that has already figured out what kinds of arguments are most important.

Rohin's opinion: I strongly agree with 2 and 4, weakly agree with 1, 5, and 6, and disagree with 3.

Technical AI alignment

Problems

Defining AI wireheading (Stuart Armstrong) (summarized by Rohin): This post points out that "wireheading" is a fuzzy category. Consider a weather-controlling AI tasked with increasing atmospheric pressure, as measured by the world's barometers. If it made a tiny dome around each barometer and increased air pressure within the domes, we would call it wireheading. However, if we increase the size of the domes until it's a dome around the entire Earth, then it starts sounding like a perfectly reasonable way to optimize the reward function. Somewhere in the middle, it must have become unclear whether or not it was wireheading. The post suggests that wireheading can be defined as a subset of specification gaming ( AN #1 ), where the "gaming" happens by focusing on some narrow measurement channel, and the fuzziness comes from what counts as a "narrow measurement channel".

Rohin's opinion: You may have noticed that this newsletter doesn't talk about wireheading very much; this is one of the reasons why. It seems like wireheading is a fuzzy subset of specification gaming, and is not particularly likely to be the only kind of specification gaming that could lead to catastrophe. I'd be surprised if we found some sort of solution where we'd say "this solves all of wireheading, but it doesn't solve specification gaming" -- there don't seem to be particular distinguishing features that would allow us to have a solution to wireheading but not specification gaming. There can of course be solutions to particular kinds of wireheading that do have clear distinguishing features, such as reward tampering ( AN #71 ), but I don't usually expect these to be the major sources of AI risk.

Technical agendas and prioritization

The Value Definition Problem (Sammy Martin) (summarized by Rohin): This post considers the Value Definition Problem: what should we make our AI system try to do ( AN #33 ) to have the best chance of a positive outcome? It argues that an answer to the problem should be judged based on how much easier it makes alignment, how competent the AI system has to be to optimize it, and how good the outcome would be if it was optimized. Solutions also differ on how "direct" they are -- on one end, explicitly writing down a utility function would be very direct, while on the other, something like Coherent Extrapolated Volition would be very indirect: it delegates the task of figuring out what is good to the AI system itself.

Rohin's opinion: I fall more on the side of preferring indirect approaches, though by that I mean that we should delegate to future humans, as opposed to defining some particular value-finding mechanism into an AI system that eventually produces a definition of values.

Miscellaneous (Alignment)

Self-Fulfilling Prophecies Aren't Always About Self-Awareness (John Maxwell) (summarized by Rohin): Could we prevent a superintelligent oracle from making self-fulfilling prophecies by preventing it from modeling itself? This post presents three scenarios in which self-fulfilling prophecies would still occur. For example, if instead of modeling itself, it models the fact that there's some AI system whose predictions frequently come true, it may try to predict what that AI system would say, and then say that. This would lead to self-fulfilling prophecies.

Analysing: Dangerous messages from future UFAI via Oracles and Breaking Oracles: hyperrationality and acausal trade (Stuart Armstrong) (summarized by Rohin): These posts point out a problem with counterfactual oracles ( AN #59 ): a future misaligned agential AI system could commit to helping the oracle (e.g. by giving it maximal reward, or making its predictions come true) even in the event of an erasure, as long as the oracle makes predictions that cause humans to build the agential AI system. Alternatively, multiple oracles could acausally cooperate with each other to build an agential AI system that will reward all oracles.

AI strategy and policy

AI Alignment Podcast: Machine Ethics and AI Governance (Lucas Perry and Wendell Wallach) (summarized by Rohin): Machine ethics has aimed to figure out how to embed ethical reasoning in automated systems of today. In contrast, AI alignment starts from an assumption of intelligence, and then asks how to make the system behave well. Wendell expects that we will have to go through stages of development where we figure out how to embed moral reasoning in less intelligent systems before we can solve AI alignment.

Generally in governance, there's a problem that technologies are easy to regulate early on, but that's when we don't know what regulations would be good. Governance has become harder now, because it has become very crowded: there are more than 53 lists of principles for artificial intelligence and lots of proposed regulations and laws. One potential mitigation would be governance coordinating committees: a sort of issues manager that keeps track of a field, maps the issues and gaps, and figures out how they could be addressed.

In the intermediate term, the worry is that AI systems are giving increasing power to those who want to manipulate human behavior. In addition, job loss is a real issue. One possibility is that we could tax corporations relative to how many workers they laid off and how many jobs they created.

Thinking about AGI, governments should probably not be involved now (besides perhaps funding some of the research), since we have so little clarity on what the problem is and what needs to be done. We do need people monitoring risks, but there’s a pretty robust existing community doing this, so government doesn't need to be involved.

Rohin's opinion: I disagree with Wendell that current machine ethics will be necessary for AI alignment -- that might be the case, but it seems like things change significantly once our AI systems are smart enough to actually understand our moral systems, so that we no longer need to design special procedures to embed ethical reasoning in the AI system.

It does seem useful to have coordination on governance, along the lines of governance coordinating committees; it seems a lot better if there's only one or two groups that we need to convince of the importance of an issue, rather than 53 (!!).

Other progress in AI

Reinforcement learning

Learning to Predict Without Looking Ahead: World Models Without Forward Prediction (C. Daniel Freeman et al) (summarized by Sudhanshu): One critique of the World Models ( AN #23 ) paper was that in any realistic setting, you only want to learn the features that are important for the task under consideration, while the VAE used in the paper would learn features for state reconstruction. This paper instead studies world models that are trained directly from reward, rather than by supervised learning on observed future states, which should lead to models that only focus on task-relevant features. Specifically, they use observational dropout on the environment percepts, where the true state is passed to the policy with a peek probability p, while a neural network, M, generates a proxy state with probability 1 - p. At the next time-step, M takes the same input as the policy, plus the policy's action, and generates the next proxy state, which then may get passed to the controller, again with probability 1 - p.

They investigate whether the emergent 'world model' M behaves like a good forward predictive model. They find that even with very low peek probability e.g. p = 5%, M learns a good enough world model that enables the policy to perform reasonably well. Additionally, they find that world models thus learned can be used to train policies that sometimes transfer well to the real environment. They claim that the world model only learns features that are useful for task performance, but also note that interpretability of those features depends on inductive biases such as the network architecture.

Sudhanshu's opinion: This work warrants a visit for the easy-to-absorb animations and charts. On the other hand, they make a few innocent-sounding observations that made me uncomfortable because they weren't rigourously proved nor labelled as speculation, e.g. a) "At higher peek probabilities, the learned dynamics model is not needed to solve the task thus is never learned.", and b) "Here, the world model clearly only learns reliable transition maps for moving down and to the right, which is sufficient."

While this is a neat bit of work well presented, it is nevertheless still unlikely this (and most other current work in deep model-based RL) will scale to more complex alignment problems such as Embedded World-Models ( AN #31 ); these world models do not capture the notion of an agent, and do not model the agent as an entity making long-horizon plans in the environment.

Deep learning

SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver (Po-Wei Wang et al) (summarized by Asya): Historically, deep learning architectures have struggled with problems that involve logical reasoning, since they often impose non-local constraints that gradient descent has a hard time learning. This paper presents a new technique, SATNet, which allows neural nets to solve logical reasoning problems by encoding them explicitly as MAXSAT-solving neural network layers. A MAXSAT problem provides a large set of logical constraints on an exponentially large set of options, and the goal is to find the option that satisfies as many logical constraints as possible. Since MaxSAT is NP-complete, the authors design a layer that solves a relaxation of the MaxSAT problem in its forward pass (that can be solved quickly, unlike MaxSAT), while the backward pass computes gradients as usual.

In experiment, SATNet is given bit representations of 9,000 9 x 9 Sudoku boards which it uses to learn the logical constraints of Sudoku, then presented with 1,000 test boards to solve. SATNet vastly outperforms traditional convolutional neural networks given the same training / test setup, achieving 98.3% test accuracy where the convolutional net achieves 0%. It performs similarly well on a "Visual" Sudoku problem where the trained network consists of initial layers that perform digit recognition followed by SATNet layers, achieving 63.2% accuracy where the convolutional net achieves 0.1%.

Asya's opinion: My impression is this is a big step forward in being able to embed logical reasoning in current deep learning techniques. From an engineering perspective, it seems extremely useful to be able to train systems that encorporate these layers end-to-end. It's worth being clear that in systems like these, a lot of generality is lost since part of the network is explicitly carved out for solving a particular problem of logical constraints-- it would be hard to use the same network to learn a different problem.

News