The following is a basically unedited summary I wrote up on March 16 of my take on Paul Christiano’s AGI alignment approach (described in “ ALBA ” and “ Iterated Distillation and Amplification ”). Where Paul had comments and replies, I’ve included them below.

I see a lot of free variables with respect to what exactly Paul might have in mind. I've sometimes tried presenting Paul with my objections and then he replies in a way that locally answers some of my question but I think would make other difficulties worse. My global objection is thus something like, "I don't see any concrete setup and consistent simultaneous setting of the variables where this whole scheme works." These difficulties are not minor or technical; they appear to me quite severe. I try to walk through the details below.

It should be understood at all times that I do not claim to be able to pass Paul’s ITT for Paul’s view and that this is me criticizing my own, potentially straw misunderstanding of what I imagine Paul might be advocating.

First and foremost, I don't understand how "preserving alignment while amplifying capabilities" is supposed to work at all under this scenario, in a way consistent with other things that I’ve understood Paul to say.

I want to first go through an obvious point that I expect Paul and I agree upon: Not every system of locally aligned parts has globally aligned output, and some additional assumption beyond "the parts are aligned" is necessary to yield the conclusion "global behavior is aligned". The straw assertion "an aggregate of aligned parts is aligned" is the reverse of the argument that Searle uses to ask us to imagine that an (immortal) human being who speaks only English, who has been trained do things with many many pieces of paper that instantiate a Turing machine, can't be part of a whole system that understands Chinese, because the individual pieces and steps of the system aren't locally imbued with understanding Chinese. Here the compositionally non-preserved property is "lack of understanding of Chinese"; we can't expect "alignment" to be any more necessarily preserved than this, except by further assumptions.

The second-to-last time Paul and I conversed at length, I kept probing Paul for what in practice the non-compacted-by-training version of a big aggregate of small aligned agents would look like. He described people, living for a single day, routing around phone numbers of other agents with nobody having any concept of the global picture. I used the term "Chinese Room Bureaucracy" to describe this. Paul seemed to think that this was an amusing but perhaps not inappropriate term.

If no agent in the Chinese Room Bureaucracy has a full view of which actions have which consequences and why, this cuts off the most obvious route by which the alignment of any agent could apply to the alignment of the whole. The way I usually imagine things, the alignment of an agent applies to things that the agent understands. If you have a big aggregate of agents that understands something the little local agent doesn't understand, the big aggregate doesn't inherit alignment from the little agents. Searle's Chinese Room can understand Chinese even if the person inside it doesn't understand Chinese, and this correspondingly implies, by default, that the person inside the Chinese Room is powerless to express their own taste in restaurant orders.

I don't understand Paul's model of how a ton of little not-so-bright agents yield a big powerful understanding in aggregate, in a way that doesn't effectively consist of them running AGI code that they don't understand.

Paul has previously challenged me to name a bottleneck that I think a Christiano-style system can't pass. This is hard because (a) I'm not sure I understand Paul's system, and (b) it's clearest if I name a task for which we don't have a present crisp algorithm. But:

The bottleneck I named in my last discussion with Paul was, "We have copies of a starting agent, which run for at most one cumulative day before being terminated, and this agent hasn't previously learned much math but is smart and can get to understanding algebra by the end of the day even though the agent started out knowing just concrete arithmetic. How does a system of such agents, without just operating a Turing machine that operates an AGI, get to the point of inventing Hessian-free optimization in a neural net?"

This is a slightly obsolete example because nobody uses Hessian-free optimization anymore. But I wanted to find an example of an agent that needed to do something that didn't have a simple human metaphor. We can understand second derivatives using metaphors like acceleration. "Hessian-free optimization" is something that doesn't have an obvious metaphor that can explain it, well enough to use it in an engineering design, to somebody who doesn't have a mathy and not just metaphorical understanding of calculus. Even if it did have such a metaphor, that metaphor would still be very unlikely to be invented by someone who didn't understand calculus.

I don't see how Paul expects lots of little agents who can learn algebra in a day, being run in sequence, to aggregate into something that can build designs using Hessian-free optimization, without the little agents having effectively the role of an immortal dog that's been trained to operate a Turing machine. So I also don't see how Paul expects the putative alignment of the little agents to pass through this mysterious aggregation form of understanding, into alignment of the system that understands Hessian-free optimization.

I expect this is already understood, but I state as an obvious fact that alignment is not in general a compositionally preserved property of cognitive systems: If you train a bunch of good and moral people to operate the elements of a Turing machine and nobody has a global view of what's going on, their goodness and morality does not pass through to the Turing machine. Even if we let the good and moral people have discretion as to when to write a different symbol than the usual rules call for, they still can't be effective at aligning the global system, because they don't individually understand whether the Hessian-free optimization is being used for good or evil, because they don't understand Hessian-free optimization or the thoughts that incorporate it. So we would not like to rest the system on the false assumption "any system composed of aligned subagents is aligned", which we know to be generally false because of this counterexample. We would like there to instead be some narrower assumption, perhaps with additional premises, which is actually true, on which the system's alignment rests. I don't know what narrower assumption Paul wants to use.

Paul asks us to consider AlphaGo as a model of capability amplification.

My view of AlphaGo would be as follows: We understand Monte Carlo Tree Search. MCTS is an iterable algorithm whose intermediate outputs can be plugged into further iterations of the algorithm. So we can use supervised learning where our systems of gradient descent can capture and foreshorten the computation of some but not all of the details of winning moves revealed by the short MCTS, plug in the learned outputs to MCTS, and get a pseudo-version of "running MCTS longer and wider" which is weaker than an MCTS actually that broad and deep, but more powerful than the raw MCTS run previously. The alignment of this system is provided by the crisp formal loss function at the end of the MCTS.

Here's an alternate case where, as far as I can tell, a naive straw version of capability amplification clearly wouldn't work. Suppose we have an RNN that plays Go. It's been constructed in such fashion that if we iterate the RNN for longer, the Go move gets somewhat better. "Aha," says the straw capability amplifier, "clearly we can just take this RNN, train another network to approximate its internal state after 100 iterations from the initial Go position; we feed that internal state into the RNN at the start, then train the amplifying network to approximate the internal state of that RNN after it runs for another 200 iterations. The result will clearly go on trying to 'win at Go' because the original RNN was trying to win at Go; the amplified system preserves the values of the original." This doesn't work because, let us say by hypothesis, the RNN can't get arbitrarily better at Go if you go on iterating it; and the nature of the capability amplification setup doesn't permit any outside loss function that could tell the amplified RNN whether it's doing better or worse at Go.

The RNN has only whatever opinion it converges to, or whatever set of opinions it diverges to, to tell itself how well it's doing. This is exactly what it is for capability amplification to preserve alignment; but this in turn means that capability amplification only works to the extent that what we are amplifying has within itself the capability to be very smart in the limit.

If we're effectively constructing a civilization of long-lived Paul Christianos, then this difficulty is somewhat alleviated. There are still things that can go wrong with this civilization qua civilization (even aside from objections I name later as to whether we can actually safely and realistically do that). I do however believe that a civilization of Pauls could do nice things.

But other parts of Paul's story don't permit this, or at least that's what Paul was saying last time; Paul's supervised learning setup only lets the simulated component people operate for a day, because we can't get enough labeled cases if the people have to each run for a month.

Furthermore, as I understand it, the "realistic" version of this is supposed to start with agents dumber than Paul. According to my understanding of something Paul said in answer to a later objection, the agents in the system are supposed to be even dumber than an average human (but aligned). It is not at all obvious to me that an arbitrarily large system of agents with IQ 90, who each only live for one day, can implement a much smarter agent in a fashion analogous to the internal agents themselves achieving understandings to which they can apply their alignment in a globally effective way, rather than them blindly implementing a larger algorithm they don't understand.

I'm not sure a system of one-day-living IQ-90 humans ever gets to the point of inventing fire or the wheel.

If Paul has an intuition saying "Well, of course they eventually start doing Hessian-free optimization in a way that makes their understanding effective upon it to create global alignment; I can’t figure out how to convince you otherwise if you don’t already see that," I'm not quite sure where to go from there, except onwards to my other challenges.

Unless of course you have so many agents in the (uncompressed) aggregate that the aggregate implements a smarter genetic algorithm that is maximizing the approval of the internal agents. If you take something much smarter than IQ 90 humans living for one day, and train it to get the IQ 90 humans to output large numbers signaling their approval, I would by default expect it to hack the IQ 90 one-day humans, who are not secure systems. We're back to the global system being smarter than the individual agents in a way which doesn't preserve alignment.

The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning. If arguendo you can construct an exact imitation of a human, it possesses exactly the same alignment properties as the human; and this is true in a way that is not true if we take a reinforcement learner and ask it to maximize an approval signal originating from the human. (If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I'm pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.)

It is not obvious to me how fast alignment-preservation degrades as the exactness of the imitation is weakened. This matters because of things Paul has said which sound to me like he's not advocating for perfect imitation, in response to challenges I've given about how perfect imitation would be very expensive. That is, the answer he gave to a challenge about the expense of perfection makes the answer to "How fast do we lose alignment guarantees as we move away from perfection?" become very important.

One example of a doom I'd expect from standard reinforcement learning would be what I'd term the "X-and-only-X" problem. I unfortunately haven't written this up yet, so I'm going to try to summarize it briefly here.

X-and-only-X is what I call the issue where the property that's easy to verify and train is X, but the property you want is "this was optimized for X and only X and doesn't contain a whole bunch of possible subtle bad Ys that could be hard to detect formulaically from the final output of the system".

For example, imagine X is "give me a program which solves a Rubik's Cube". You can run the program and verify that it solves Rubik's Cubes, and use a loss function over its average performance which also takes into account how many steps the program's solutions require.

The property Y is that the program the AI gives you also modulates RAM to send GSM cellphone signals.

That is: It's much easier to verify "This is a program which at least solves the Rubik's Cube" than "This is a program which was optimized to solve the Rubik's Cube and only that and was not optimized for anything else on the side."

If I were going to talk about trying to do aligned AGI under the standard ML paradigms, I'd talk about how this creates a differential ease of development between "build a system that does X" and "build a system that does X and only X and not Y in some subtle way". If you just want X however unsafely, you can build the X-classifier and use that as a loss function and let reinforcement learning loose with whatever equivalent of gradient descent or other generic optimization method the future uses. If the safety property you want is optimized-for-X-and-just-X-and-not-any-possible-number-of-hidden-Ys, then you can't write a simple loss function for that the way you can for X.

The team that's building a less safe AGI can plug in the X-evaluator and let rip, the team that wants to build a safe AGI can't do things the easy way and has to solve new basic problems in order to get a trustworthy system. It's not unsolvable, but it's an element of the class of added difficulties of alignment such that the whole class extremely plausibly adds up to an extra two years of development.

In Paul's capability-amplification scenario, if we can get exact imitation, we are genuinely completely bypassing the whole paradigm that creates the X-and-only-X problem. If you can get exact imitation of a human, the outputs have only and exactly whatever properties the human already has. This kind of genuinely different viewpoint is why I continue to be excited about Paul's thinking.

On the other hand, suppose we don't have exact imitation. How fast do we lose the defense against X-and-only-X? Well, that depends on the inexactness of the imitation; under what kind of distance metric is the imperfect imitation 'near' to the original? Like, if we're talking about Euclidean distance in the output, I expect you lose the X-and-only-X guarantee pretty damn fast against smart adversarial perturbations.

On the other other hand, suppose that the inexactness of the imitation is "This agent behaves exactly like Paul Christiano but 5 IQ points dumber." If this is only and precisely the form of inexactness produced, and we know that for sure, then I'd say we have a pretty good guarantee against slightly-dumber-Paul producing the likes of Rubik's Cube solvers containing hidden GSM signalers.

On the other other other hand, suppose the inexactness of the imitation is "This agent passes the Turing Test; a human can't tell it apart from a human." Then X-and-only-X is thrown completely out the window. We have no guarantee of non-Y for any Y a human can't detect, which covers an enormous amount of lethal territory, which is why we can't just sanitize the outputs of an untrusted superintelligence by having a human inspect the outputs to see if they have any humanly obvious bad consequences.

Speaking of inexact imitation: It seems to me that having an AI output a high-fidelity imitation of human behavior, sufficiently high-fidelity to preserve properties like "being smart" and "being a good person" and "still being a good person under some odd strains like being assembled into an enormous Chinese Room Bureaucracy", is a pretty huge ask.

It seems to me obvious, though this is the sort of point where I've been surprised about what other people don't consider obvious, that in general exact imitation is a bigger ask than superior capability. Building a Go player that imitates Shuusaku's Go play so well that a scholar couldn't tell the difference, is a bigger ask than building a Go player that could defeat Shuusaku in a match. A human is much smarter than a pocket calculator but would still be unable to imitate one without using a paper and pencil; to imitate the pocket calculator you need all of the pocket calculator's abilities in addition to your own.

Correspondingly, a realistic AI we build that literally passes the strong version of the Turing Test would probably have to be much smarter than the other humans in the test, probably smarter than any human on Earth, because it would have to possess all the human capabilities in addition to its own. Or at least all the human capabilities that can be exhibited to another human over the course of however long the Turing Test lasts. (Note that on the version of capability amplification I heard, capabilities that can be exhibited over the course of a day are the only kinds of capabilities we're allowed to amplify.)

An AI that learns to exactly imitate humans, not just passing the Turing Test to the limits of human discrimination on human inspection, but perfect imitation with all added bad subtle properties thereby excluded, must be so cognitively powerful that its learnable hypothesis space includes systems equivalent to entire human brains. I see no way that we're not talking about a superintelligence here.

So to postulate perfect imitation, we would first of all run into the problems that:

(a) The AGI required to learn this imitation is extremely powerful, and this could imply a dangerous delay between when we can build any dangerous AGI at all, and when we can build AGIs that would work for alignment using perfect-imitation capability amplification.

(b) Since we cannot invoke a perfect-imitation capability amplification setup to get this very powerful AGI in the first place (because it is already the least AGI that we can use to even get started on perfect-imitation capability amplification), we already have an extremely dangerous unaligned superintelligence sitting around that we are trying to use to implement our scheme for alignment.

Now, we may perhaps reply that the imitation is less than perfect and can be done with a dumber, less dangerous AI; perhaps even so dumb as to not be enormously superintelligent. But then we are tweaking the “perfection of imitation” setting, which could rapidly blow up our alignment guarantees against the standard dooms of standard machine learning paradigms.

I'm worried that you have to degrade the level of imitation a lot before it becomes less than an enormous ask, to the point that what's being imitated isn't very intelligent, isn't human, and/or isn't known to be aligned.

To be specific: I think that if you want to imitate IQ-90 humans thinking for one day, and imitate them so specifically that the imitations are generally intelligent and locally aligned even in the limit of being aggregated into weird bureaucracies, you're looking at an AGI powerful enough to think about whole systems loosely analogous to IQ-90 humans.

I think that is a very powerful AGI. I think this AGI is smart enough to slip all kinds of shenanigans past you, unless you are using a methodology that can produce faithful imitations from unaligned AGIs. I think this is an AGI that can do powerful feats of engineering, unless it is somehow able to simulate humans doing powerful feats of engineering without itself being capable of powerful feats of engineering.

And then furthermore the capability amplification schema requires the AGI to be powerful enough to learn to imitate amplified systems of humans as well!

If you don’t degrade the AI to well below the point of superintelligence, we require extremely strong Faraday-cage keep-God-in-a-box don't-peek-at-the-code AI boxing. So we cannot for example let an untrusted superintelligence originate queries that it can use to learn human behavior; it has to be strictly unsupervised example-based learning rather than a query model. This interacts with worries about sample complexity below.

Or you need to postulate some prior alignment of the AGI that lets you robustly and safely direct it at the imitation problem. Then (a) your alignment schema depends on an unspecified prior alignment schema that already works to make very powerful AGIs do things. And (b) you probably aren't allowing unrestrained intelligence explosions, which probably introduces a big unavoidable delay between the first point where very powerful AGI is technologically possible, and the later point where you can build and safely align the extremely powerful AGI that you're using as the base unit in your imitation scheme.

The last I checked with Paul, his scheme bit the bullet on using powerful smart dangerous units to imitate humans. Paul's scheme for bypassing this, as I heard and understood it back then, involved a scheme which:

Trained a million smart predictors of sufficient diversity that at least one predictor would happen to end up with the true sole simple goal of predicting faithfully;

Compared the predictors against each other, finding cases where their predictions diverged sufficiently;

Tested them against reality;

Had enough sampling power to perform a sufficient number of tests that, even if a superintelligence cunningly delays the time when it starts trying to slip in bad predictions, we can pause the system and run the experimental test and generate new samples and kick out the unfaithful predictor at that time.

That's a one-millionfold performance penalty, if I understood it correctly.

If that was just a concrete illustration to help me understand what's happening at all, and actually there's supposed to be some obvious less naive more efficient scheme for making use of mostly untrusted superintelligences which doesn't have the giant performance penalty, then I haven't seen it yet.

I also have trouble seeing how kicking out unfaithful predictors is possible under an unsupervised-learning paradigm that doesn't actually allow hypercomputation.

The problem being: the exact output of a human is always going to be very improbable. Even if the agents can only send each other text messages and no illustrations (which further reduces their collective intelligence and ability to produce textbooks readable in a day), a text message is still an enormous possibility space. If we ask superintelligences to give us their predictions for what the human does next, their predictions cannot possibly actually include a giant lookup table next to all the probable outputs of the human.

I think I see how I’d “solve” this problem using hypercomputation and an enormous number of samples: I require the superintelligences to give me their giant lookup tables of probabilities for exact outputs, calculate total distances between probability tables (KL divergence or whatever), and if the distance is great enough, I sample a human and do a Bayesian update. I drop any superintelligences whose prior probability goes low enough. I keep doing this until only agreeing superintelligences remain.

Under this system it looks to me like you need a lot of samples to kick out any superintelligences trying to sneak in bad suggestions. Let's say that an uncompressed Chinese Room bureaucracy is doing the equivalent of sending around a quadrillion messages, and that the output can be corrupted by an adversarial attack against one billion of those messages such that there's a \(\frac{1}{1,000,000}\) chance of outputting an adversarially chosen alternative to each of those billion messages. Well, the only way a superintelligence can allocate a \(\frac{1}{1,000,000}\) generation probability to those fake messages, is by stealing 0.0001% probability mass from all the other messages that might be output. So each time we check against a real human, we update by \(\log_{2}\) \(\frac{999,999}{1,000,000}\) bits, and after taking a million samples we’ve driven down the relative odds of this naughty superintelligence by \(\frac{1}{e}\).

We had to take a million samples to do that, and each of the samples involves a human thinking for one day. And that just got us \(e\):1 evidence, so we've got to take a million samples 7 times to get to the point of 1000:1 odds against the superintelligence being honest. And if we have a million superintelligences in the distribution, some of them can make skewed bets à la predicting coinflips whereby a thousand of those will be able to gain 1000:1 advantages over the others, so actually you've got to take 14 million samples to be sure of beating any bad predictor classes that collectively try that scheme.

I think last time I talked to Paul about this, Paul was like, "Sure, we're talking about a gigantic government project that can take millions of samples."

I'm not sure, I could be wrong about this, but I think you actually need 14 million samples per bad predictor that you want to kick out of the distribution.

Even if I’m wrong about that, we need millions of humans who are smart enough and aligned enough that random samples from their number can agglomerate into a smart aligned Chinese Room Bureaucracy.

I also don't see a way to do this without comparing the giant lookup tables using hypercomputation, but there I might just be missing a math trick.

(If you're thinking of GANs, then so far as I can tell, the discriminator has to be at least as smart as the generator, and you have to trust the discriminator, and there isn't any simpler problem with respect to how you find a trusted superintelligence within a collective of untrusted ones to act as your discriminator.)

(EDIT: Actually, after thinking about this for another five minutes, maybe I do see how to do it with GANs and lower sample complexity.)

An even larger issue is that I don't see any obvious way to carry out a scheme like this one at all with respect to imperfect imitations. (And the above scheme I thought of with GANs would also just fail.)

I think these arguments are collectively something like a crux. That is, unless I've missed one of my own thought processes in the course of writing this up rapidly, or assumed a shared background assumption that isn't actually shared.

Let's say that D is the degree of imperfection allowed by some system of capability amplification, and call D-imperfect imitations D-imitations. Iterated D-imitations of amplified systems of D-imitations will be termed DD-imitations. Then I think I'd start to be pragmatically interested in capability amplification as I understood it, if I believed all of the following:

We can, before the world is ended by other unaligned AIs, get AIs powerful enough to learn D-imitations and DD-imitations;

D-imitations and DD-imitations robustly preserve the goodness of the people being imitated, despite the imperfection of the imitation;

D-imitations agglomerate to sufficient cognitive power to perform a pivotal act in a way that causes the alignment of the components to be effective upon aligning the whole; and imperfect DD-imitation preserves this property;

We can find any way of either:

Individually trusting one AI that powerful to faithfully perform the task of D-imitation (but then why can't we just use this scheme to align a powerful AGI in the first place?);



Find a scheme for agglomerating mostly untrustworthy powerful intelligences which:



Doesn't require giant lookup tables, doesn't require a GAN with a trusted discriminator unless you can say how to produce the trusted discriminator, and can use actual human samples as fuel to discriminate trustworthiness among untrusted generators of D-imitations.





Is extremely sample-efficient (let's say you can clear 100 people who are trustworthy to be part of an amplified-capability system, which already sounds to me like a huge damned ask); or you can exhibit to me a social schema which agglomerates mostly untrusted humans into a Chinese Room Bureaucracy that we trust to perform a pivotal task, and a political schema that you trust to do things involving millions of humans, in which case you can take millions of samples but not billions. Honestly, I just don't currently believe in AI scenarios in which good and trustworthy governments carry out complicated AI alignment schemas involving millions of people, so if you go down this path we end up with different cruxes; but I would already be pretty impressed if you got all the other cruxes.





Is not too computationally inefficient; more like 20-1 slowdown than 1,000,000-1. Because I don't think you can get the latter degree of advantage over other AGI projects elsewhere in the world. Unless you are postulating massive global perfect surveillance schemes that don't wreck humanity's future, carried out by hyper-competent, hyper-trustworthy great powers with a deep commitment to cosmopolitan value — very unlike the observed characteristics of present great powers, and going unopposed by any other major government. Again, if we go down this branch of the challenge then we are no longer at the original crux.

I worry that going down the last two branches of the challenge could create the illusion of a political disagreement, when I have what seem to me like strong technical objections at the previous branches. I would prefer that the more technical cruxes be considered first. If Paul answered all the other technical cruxes and presented a scheme for capability amplification that worked with a moderately utopian world government, I would already have been surprised. I wouldn't actually try it because you cannot get a moderately utopian world government, but Paul would have won many points and I would be interested in trying to refine the scheme further because it had already been refined further than I thought possible. On my present view, trying anything like this should either just plain not get started (if you wait to satisfy extreme computational demands and sampling power before proceeding), just plain fail (if you use weak AIs to try to imitate humans), or just plain kill you (if you use a superintelligence).

I restate that these objections seem to me to collectively sum up to “This is fundamentally just not a way you can get an aligned powerful AGI unless you already have an aligned superintelligence”, rather than “Some further insights are required for this to work in practice.” But who knows what further insights may really bring? Movement in thoughtspace consists of better understanding, not cleverer tools.

I continue to be excited by Paul’s thinking on this subject; I just don’t think it works in the present state.

On my view, this is not an unusual state of mind to be in with respect to alignment research. I can’t point to any MIRI paper that works to align an AGI. Other people seem to think that they ought to currently be in a state of having a pretty much workable scheme for aligning an AGI, which I would consider to be an odd expectation. I would think that a sane point of view consisted in having ideas for addressing some problems that created further difficulties that needed to be fixed and didn’t address most other problems at all; a map with what you think are the big unsolved areas clearly marked. Being able to have a thought which genuinely squarely attacks any alignment difficulty at all despite any other difficulties it implies, is already in my view a large and unusual accomplishment. The insight “trustworthy imitation of human external behavior would avert many default dooms as they manifest in external behavior unlike human behavior” may prove vital at some point. I continue to recommend throwing as much money at Paul as he says he can use, and I wish he said he knew how to use larger amounts of money.