I believe the current Christiano blackbox preference learning approach (Christiano et al 2017/Ziegler et al 2019) could be improved to make it more compute-efficient, sample-efficient, and simpler. There are two ways that seem particularly relevant for music/text generation:

The new architecture & training would go like this when combined:

Here I propose changing the agent/generator model architecture to explicitly optimize the reward model’s utility/reward score, by removing the agent/generator entirely and instead improving possible sequences by gradient ascent on the (differentiable) reward model. There is no need to build a redundant agent model when the reward model is differentiable and can be used to directly specify how an input sequence ought to change to improve it. This simplifies the overall architecture greatly, avoids expensive & unstable & complex blackbox training of DRL agents, and enables easy generation of both high-scoring & highly-diverse (thus informative) sequences for an oracle to rate, which can then be fed back into the reward model for further training. To the extent an agent/generator is necessary to efficiently generate many sequences, it can be trained quickly & stably by imitation learning on a corpus of datapoints optimized by the model.

While running PPO against the reward model, I concluded that compared to other approaches I’ve seen to optimizing the outputs of a NN, the blackbox preference learning approach has 2 major flaws:

Compute-Inefficient: it is slow and memory-hungry (I have to use GPT-2-117M to fit reasonable minibatches onto 2×1080ti, and even then iterations can take days) Single-Divergence-Prone: it ‘mode collapses’ into adversarial samples, typically highly repetitive, and typically eventually only one adversarial sample

Slow feedback: 1 day, 1 counter-example. This makes iteration slow because of the double-whammy: each run takes days before any score improvements or divergence, and when it diverges, it typically only yields a handful of usable adversarial datapoints to rate & retrain on. Thus, the frustrating experience of seeing each run end in just one adversarial sample, which may be only somewhat different from the previous run.

Mode collapse. Thinking about this, a blackbox RL approach doesn’t seem quite right. For an RL problem, it’s fine to find only a single path which leads to a high reward. To put it in GAN terms, this ‘mode collapse’ onto a single adversarial example is, as far as the agent/Generator is concerned, a 100% valid solution. The ‘environment’ has no memory, and cannot penalize the agent/Generator for repetition. If there exists any string “XYZ” which, on its own or appended to any other string, causes the reward model/Discriminator to always output the maximal reward, then why does the agent/Generator need to learn anything else? It won. But that’s not really what we want. We want it to sample from the full distribution of high-quality sequences. Unfortunately, mode collapse is not solved in GANs, and I can’t think of any easy how to easily fix it in this preference learning either.

Ask reward model how to edit samples. One approach to avoid both those issues is to drop the blackbox optimizer approach entirely—which incentivizes wasting a ton of compute to find a single adversarial example—and instead optimize datapoints directly. It seems like a waste to go to all this effort to build a differentiable surrogate (reward) model of the environment (the human), and then treat it like just another blackbox. But it’s not, that’s the whole point of preference learning! Since GPT-2 is differentiable, it ought to be possible to backprop through it to do planning and optimization like MPC. Typically, we hold the inputs and outputs fixed and use backprop to adjust the model, but one can instead hold the model fixed and adjust the inputs based on backprop to give a desired output: in this case, hold a GPT-2 reward model fixed, and adjust textual inputs to make the output, the reward, larger, by backpropping from the output back through the model to the current input. This is an approach which works well for editing images in GAN, and ‘optimizing images’ to maximize a CNN’s probability of classification as a ‘dog’ or ‘cat’ etc has long been done as a way of visualizing what a CNN has learned. For example, one could generate high-scoring music pieces by generating a random sequence, text-embedding it into the vector for the reward model, and then doing gradient ascent on the vector. (No PPO cluster required.) This is equivalent to doing planning/revising, as at each iteration, GPT-2 ‘considers’ the sequence as a whole and can make global changes rather than local changes to the final entry in the sequence; over many iterations, it can ‘edit’ a sequence repeatedly, rather than being forced to generate the entire sequence in a single shot like PPO is. This could be a lot faster since it exploits the whitebox nature of a learned reward model instead of treating it as a high-variance blackbox.

Example: PPLM. A similar approach to optimizing GPT-2 outputs has since been published by Uber as “PPLM” (Dathrati et al 2019; blog). PPLM uses the gradients from GPT-2 and a control NN to do simultaneous gradient ascent, trying to optimize an input to maximize both likelihoods, thereby maintaining sensible English text (due to GPT-2) which still maximizes the parallel target (such as a ‘positivity’ goal).

Another possibility would be to try to use beam search (although it has produced bad results in NNs as discussed in the nucleus sampling paper), perhaps due to the log likelihood training encouraging repetition) or the expert iteration/MCTS training from AlphaGo Zero. MCTS was originally introduced for planning in general MDPs, it isn’t inherently limited to two-player games, the “rules” of generating sequence data is trivial (anything ASCII, in this case), and the discriminator provides a well-defined reward. So instead of a NN which directly generates a next character, it could instead (given a particular prefix/history) output values for the 128 ASCII values, run MCTS search for a while, produce a refined value for each character, and retrain the NN towards the refined values; every minibatch of the generator, one generates a bunch of examples for the human to judge and provide a new minibatch for the discriminator. Hence, tree iteration learning-from-preferences deep RL. With music we don’t necessarily need the stable self-play that tree iteration provides since I’m not too clear conceptually what one would expect self-play to deliver (it is inherently a human-defined problem, as opposed to Go where it’s external and human preferences are not the criteria), but given the AlphaZero & Anthony’s Hex results, this could be considerably more computation-efficient by providing much more supervision at each timestep instead of providing just a little bit of supervision from the end result of win/lose with REINFORCE. Possibly also more human-sample-efficient?

Bootstrap. Ratings can be done pairwise on the various optimized sequences (random pairs of high-scoring sequences, although before/after comparisons might be more informative), and then the reward model trained.

Amortized inference. If gradient ascent is too slow for routine use, then one can just distill the reward model via training the GPT-2 on successively better corpuses in the usual efficient quick likelihood training (imitation learning) way, for a sort of ‘expert iteration’: generate improved versions of a corpus by generating & selecting new datapoints above a threshold (perhaps using a corpus of human datapoints as starting points), and train to generate that.

Automatic editing via gradient ascent. Gradient ascent can be used to control and optimize text in various ways. For example, fiction could be edited in one region (say, to change character’s name from “Miriam” to “Mary”) and then the edited region could be held fixed during gradient ascent, while the rest of the unedited text is free to vary; this would propagate the edits, because self-contradictory text is unlikely while self-consistent text is more likely. (Because it is a more likely text for the protagonist to be consistently named “Mary” throughout the entire text rather than named “Mary” in one place and “Miriam” everywhere else.) One could produce multiple versions of a text by speculative edits—what if this character was a pirate? what if the protagonist lost this battle instead of winning? what if a banquet scene was deleted entirely?—and select the best one. One could also do this on heterogeneous sets of text, such as collaboratively-edited works like the “SCP Foundation” wiki: there is no linear structure like a novel, but one could take an edited entry and concatenate it with an arbitrary second entry, do ascent on the second entry to make it more consistent, and save the modified version; iterated repeatedly over the entire set of entries, one would witness the wiki ‘growing’ organically—changes in one entry, like a new SCP or character, will automatically pop up elsewhere in logical ways, with all seams gradually smoothed over into one evolved whole.

Parallel generation of counter-examples. If nothing else, I think it would help with the adversarial instances. Part of the problem with them is that each PPO run seems to collapse into a single specific adversarial instance. I can do a bunch of ratings which penalizes all variants of that instance which fixes it, but then I must wait another day or two, and then that run collapses into a new single adversarial instance. The reward model seems to gradually get better and the adversarial instances seem to gradually increase in complexity, but the process is slow and serial. The gradient ascent approach may also run into the problem that it will find adversarial instances for the reward model, but at least it will do so in parallel: if I can run a minibatch n=11 of GPT-2-117M reward models each starting with a different random initial sequence being optimized and do gradient ascent on each in parallel, they will probably find multiple adversarial instances in parallel, while the PPO would only find the one it collapses on. So one would get a lot more useful adversarial instances to rate per run.

Optimizing embeddings = nonsense? One of the most likely drawbacks to such gradient ascent approaches on the embedding is the possibility that the maximized embedding will not then convert back to any kind of sensible discrete symbol sequence, a failure mode which has caused trouble in attempts to do feature visualization via gradient ascent on BERT text models and on T5 models (requiring adding VAE autoencoder-like constraints to make the latent space ‘smooth’, and—at least with extremely low-dimensional latent spaces like n = 2–10—tokens can be too separated to be reached by gradient-following).