Batch Normalization, and the zoo of related normalization strategies that have grown up around it, have played an interesting array of roles in recent deep learning research: as a wunderkind optimization trick, a focal point for discussions about theoretical rigor and, importantly, but somewhat more in the sidelines, as a flexible and broadly successful avenue for injecting conditioning information into models.

Conditional renormalization started humbly enough, as a clever trick for training more flexible style transfer models, but over the years this originally-simple trick has grown in complexity and conceptual scope. I kept seeing new variants of this strategy pop up, not just on the edges of the literature, but in its most central and novel advances: from the winner of 2017’s ImageNet competition to 2018’s most impressive generative image model. The more I saw it, the more I wanted to tell the story of this simple idea I’d watched grow and evolve from a one-off trick to a broadly applicable way of integrating new information in a low-complexity way.

While I’ll do some explaining of (relatively simple) mathematical mechanics along the way, my main goal is to explore the different uses of conditional normalization, chart out points where the idea evolved or expanded, and generally tell a coherent story about how it and its intellectual descendants have been playing increasingly ascendent role in modern deep learning

What is Conditioning?

In the most basic statistical sense, conditioning involves fixing some variable or set of variables to fixed values, and observing the shifted distribution that comes from limiting the worlds you consider to ones with those conditions hold. If you’re considering the joint distribution over temperature and month, your belief distribution over “what temperature is it now”, taken over the whole year, will be meaningfully different than a distribution of temperatures conditioned on month=January. The extent to which the other features in a joint distribution change as a result of conditioning other variables is based on how statistically related the two groups of features are. Continuing our prior example, we wouldn’t particularly expect our distribution over all temperatures to be different than the distribution of over temperatures when the day of the week is Sunday, because weather doesn’t vary in ways that would make Sunday reliably hotter or colder than any other day. If your conditional distribution given a feature is different than the unconditional one, then knowing the values of that feature has given you information about values of the remaining (presumed yet-unknown) features.

You can think of this through a predictive frame: if you were just going to pick a temperature by sampling from the year-long distribution, your error would be quite high in expectation. If you know you’re predicting for a day in January, you’ll likely be able to reduce your error by quite a lot. And, in fact, this simple conceptual tool of conditioning is exactly what’s used to represent the prediction problem within neural nets: learning a distribution (or point estimate) prediction of y, conditioned on the values of the known variables X. The hope, in prediction tasks, is that your X variables carry a lot of information about the values of y, so that building a model off of them can reduce your error relative to a random unconditional guess.

So, since all supervised models condition an output distribution on known input features, all models are in some sense conditional, but we tend to reserve the term for a specific subset of models where we use conditioning as a control lever to get our model to perform a desired behavior. Here are some recurring use cases of what we call conditioning that I’ll discuss in more detail later:

Answer questions about an image, conditioned on the specific question asked

Draw samples from a generative image model conditioned on a specific visual category (cat, dog, house, etc)

Generate images that mimic the visual style of some particular input image

A common feature of all these tasks is that they’re situations where we want a higher degree of control over the computation that our model performs at deployment time. We want the ability to ask (semi) arbitrary questions, and get the answer to whichever question we happen to answer, rather than some summarization of the image that’s hardcoded at test time. Rather than just sample from the space of all images, we want to tell the model that we want to sample a cat, or that we want to sample a painting in the style of Monet.

Pragmatically, this kind of control is achieved by designing a pathway to add a compressed form of some kind of information (say, a question) to our net, and merge it with the “main” input of the model (say, an image). Then we can train the model to learn an output distribution of answers conditioned on the fixed values of both image and question, which is hopefully more precise than if we either couldn’t see the image, or if we didn’t know what question was being asked. If I wanted to use an image generation model for some practical purpose (and GANs have got to be good for something practical), then that model would be more useful if I could give the model a request and have it give me a reasonable response. To put it another way, even if an image generator can perfectly capture the global distribution of images, typically we don’t actually *want* to sample from the global distribution, we want to sample something specific.

I also tend to think of conditioning as a simplified form of multi-task learning: we want to be able to perform tasks that in some contexts might be thought of as distinct (answer question A, answer question B) but for which we’re using a joint model with a few conditional parameters rather than fully separate models, because we think the tasks are similar enough to benefit from sharing parameters and extracted features.

There are some interesting conceptual insights that I think the world of conditioning has to offer, but they’re a bit difficult to explain in advance of examples, so I’d like to take you on a tour of some different conditioning approaches, and, along the way, highlight differences, evolutions in technique, and subtleties of problem formulation.

A Tour Through the Model Conditioning Zoo

Conditional GANs

One of the simplest forms of conditioning, both in terms of problem formulation and mechanism, is the conditional Generative Adversarial Network (GAN).

In an unconditional GAN, you sample a vector Z from some distribution (typically a multivariate Gaussion), and train your network to map that vector into a pixel image that is indistinguishable from images in the true training set. A GAN’s goal is to learn parameters that map each possible sample vector to a plausible image output. In an unconditional GAN, you might learn to generate images of cats and dogs, but in order to know which vectors corresponded to either one, you’d need to pick samples and run those samples forward through the GAN, which would be a computationally costly enterprise.

A conditional GAN specifies as part of its input during training which class it’s trying to generate from, and essentially incentives your model to learn many per-class distributions, instead of one single distribution capturing all image categories at once. Instead of just feeding in your sampled z vector, you concatenate z with the one-hot representation corresponding to a particular category within your data distribution. Then, whenever you feed in the one-hot label corresponding to say, cats, you train the network to generate images that appear to be drawn from the true distribution of cats; when you feed in the label for dog, you train it against the distribution of dogs, and so on.

This concatenation procedure is quite conceptually simple: it means that the first layer of your network has more parameters, and that the vector representation passed forward to higher layers should somehow capture the fact of the label, but since the vector isn’t any larger for that additional information, it doesn’t require additional parameters.

A downside of the approach is that conditioning information is only fed in once, when Z is fed into the network. The result of this is that each layer needs to learn to pass forward the information that will be needed by subsequent layers. As a simple example, imagine you’re layer 2 of a GAN. You see an input under the class “cat”, and realize that you should align the features available to you (things like “foot”, “leg”, “head”, etc), in an alignment that corresponds to a cat body. But you also need to make sure to pass forward the fact of the image’s cat-ness, since that helps the next layer know that the foot should be a cat foot, the head should be a cat head, etc. This isn’t something the model has a built-in bias to do, and so it takes model capacity to learn a mechanism to ensure class information is passed forward. This weakness prompts the question of whether it’s possible to reintroduce conditional information at later points in the network, without adding too many additional parameters.

Instance Normalization for Style Transfer (2017)

A lot of interesting conditioning work has come out of the world of style transfer, where we want to condition the generation of one image on the style statistics of another image. The original way to accomplish this was to directly optimize the pixels of a single image so they matched the activation statistics of the reference image. But that was pretty inefficient, since we needed to perform optimization for each image, and so then we started to train networks that could take in a reference image, and transform it to a style-transformed version for some particular style with just a forward pass of the network. But even that required training a separate network for each style we wanted to do transfer with. This paper, on Adaptive Instance Normalization for Style Transfer, came in response to that problem, and built a conditional network that could be trained to transform images into multiple different styles, based on whatever style label was passed in.

They accomplished this through a technique they called Adaptive Instance Normalization, because it built on the existing mechanics of batch normalization with a small tweak that let it modulate the activations differently for different styles. Batch normalization has two parts:

Normalization (where we shift and scale by the batch mean and variance respectively), and

Renormalization, where the network rescales and re-shifts the normalized feature values according to learned parameters.

Typically, the network would learn one gamma and beta (as seen in the below equation) per feature channel, and use that same (gamma, beta pair) for that channel in every spatial position.

Conditional normalization modifies this basic approach by, instead of learning a single parameter pair per channel, learning a parameter pair per (channel, style_id) combination. So, if we’re trying to imitate Starry Night, we rescale each feature with one set of parameters, and if we’re trying to imitate the Mona Lisa, we’d use another.

A figure from the Adaptive Style Transfer paper showing the differences that can be achieved by using the renormalization parameters corresponding to each style

It’s worth taking a minute to dig a little more deeply into this, and be impressed at the fact that it’s able to work as well as it does. These renormalization parameters only impact the activation value of one feature at a time; there are no parameters governing the interaction effects between features as a function of what style we’re imitating. And they’re not spatially or contextually smart in any way: it’s a purely linear transformation, and we’ll rescale an activation of a given feature for a given style in the same way regardless of the value of that underlying activation at that position.

However, despite these limitations, instance normalization proved powerful enough to deliver impressive results in this case, and variants of this simple kernel of an idea have gone on to be applied in a variety of settings; the rest of this post will be focused on charting its path.

But first, a quick note on terminology and intellectual lineage: a 2017 paper out of MILA proposed Featurewise Linear Modulation (FiLM) as a catch-all term for conditioning mechanisms that operate by shifting and scaling each feature independently of the others, according to some learned or calculated shift and scale factor.

A figure from the FiLM paper, showing how channel-wise scale and shift modulation works in a convolutional network

I personally prefer this terminology, as I think it more cleanly captures the central insight, and puts less emphasis (correctly, I think) on the connection to batch normalization. However, later papers continued to use the “conditional normalization” framing, and built on this original Adaptive Instance Norm paper. As a result, in the rest of this post, I may alternate between “renormalization” and “modulation” as words that mean the same thing: shifting and scaling each feature in each layer according to some set of parameters. The interesting question, and the one along which this method becomes interestingly general, is: where do these parameters come from?

GauGAN: Masked Scene Conditioning

An interesting variation of the above technique is the one used by GauGAN, the painting generator found here. This model takes as input a mask of scene categories, with different parts of the image attached to different scenes types. It then learns a learns a different set of renormalization parameters for each scene type, so that for each (scene type, layer, feature) combination, there’s a unique shift and scale value.

A diagram showing how GauGAN works: the user passes in 1) a map corresponding to where in the image they want different kinds of landscape features, and 2) a style image, and the network applies, for example, the conditional parameters corresponding to “tree” in places where that mask is present.

For each part of the image, the model applies the renormalization parameters that correspond to the scene type specified in the mask for that location. So, it’s as if we’re using a class-conditional model, like the ones mentioned earlier, except that we’re generating different classes in different locations of the image. This is a straightforward generalization of the Style Transfer case: where that learns one parameter set for each of a few discrete categories, and applies one set at a time across the whole image, GauGAN applies its landscape-specific parameters based on the user is trying to generate in a given region of the image.

Image-Based Question Answering

Both of the techniques above condition on discrete, categorical information: one style out of a fixed and finite set of styles, one landscape category out of a fixed set of categories. In these cases, its possible to just learn a unique parameter set that corresponds to each category, and pull out the parameter set that applies to the category you’re trying to condition on.

But in some cases, instead of a one-hot or categorical variable, you might want to condition on a continuous vector that can vary smoothly between different representations, like an image, or a Word2Vec vector, or a sequence of words: something more complex than a simple one-of-K categorical value. In this situation, we can’t simply learn different per-category parameters, because there are infinitely many points in the continuous space; we instead need a mapping function that takes in whatever complex data object we want to condition on, and passes information forward in a way that the conditioning function can use. (Where “mapping function” generally means “a neural net that processes the input in some way and spits out a vector”)

Building on this intuition, in the example I’m referencing here, the problem is to take in an image and a question about the content of that image, and to produce the answer to the question asked. Here, the image is treated as the primary input, and the question as the conditioning input. The authors used an extension of the same conditional renormalization trick, but instead of having one set of scale and shift parameters for each style, they train a network to output values of these parameters for a given sentence. Put more concretely, given some question, a LSTM learns a mapping such that the scale and shift induced by that question in a convolutional network processing the image conditions that network in a way which makes its output softmax more likely to output the answer that particular question. Now, instead of capturing information about conditioning in the reparametrization weights themselves, we capture it in the network that produces the weights, given some input.

One Shot Videos of Facial Expressions

In addition to the ability to process complex data, another benefit of using a network to map generic inputs into conditioning weights is that you can condition on inputs you never saw during training, because the network will have (hopefully) learned a mapping that’s generalizable. A recent paper makes use of this to train a network that generates frames of faces making desired arbitrary expressions, even if it’s only seen one example of the face. In this problem,

This model needs to combine two pieces of information in order to generate its artificial image:

1) The facial landmark data of the expression we want our subject to be making

2) At least one reference photo of the subject, with landmark data drawn on top of the image

A diagram showing the data sources for one-shot expression transfer. The expression face is turned into landmarks, and a network is run on those landmarks, using conditional feature modulation parameters generated from the source image, to generate the result.

The authors chose a generation structure that takes in the desired landmark as input, and gradually does inverse convolutions to scale that up to an image of an entire face making an expression. In order to make this not just an average face, but our target face making that expression, they perform conditional renormalization throughout the generation network, with renormalization parameters generated by a CNN that takes in the target face as input. When we see a new face at test time, we pass it through this CNN, and hope that it produces parameters that do a good job of modulating feature values to guide production of that face. Zooming back out from the perspective of conditioning, we can think of this as a kind of meta-learning, where the network is learning how to generate the parameters of its own computation based on the input it sees.

Speaker-Conditioned Speech Recognition

Starting out from an initial kernel of one set of renormalization parameters, conditional normalization methods have expanded the boundaries of the idea of vanilla batch norm first to multiple parameter sets, and then to dynamically generated parameters. Conditioning methods have also recently become more stranger and more creative in ways other than the mechanism of generating parameters; one example of this is the technique of self-conditioning. At a very high level, self-conditioning is a way to condition local computation on a summarized version of global information.

In the first problem where we’ll look at self-conditioning being applied, the objective is to perform speech recognition, taking in an audio waveform and predicting the sequence of words that are being spoken. When the paper in question was being written, recurrent neural nets, or RNNs, were the standard mechanism for sequence to sequence problems like this. (It’s a year or so old; it seems likely state of the art these days uses a transformer, though I’ve personally not gone down that rabbit hole to check). RNNs iteratively accumulate state as they progress over a sequence, and use that state to inform their calculations at that point in the sequence. This allows the network to incorporate past context, but in practice RNNs tend to focus on the recent past, and have difficulty keeping track of old information.

In this context of mostly local computation, we can see how it would be useful to have compact global information available. As an example, if you only heard a small snippet of text, and didn’t know the speaker’s accent, you might have a hard time understanding how the sounds they use map onto words, whereas if you zoomed out to hear more, you could use your knowledge of how words sound in, say, an English accent, to correctly parse them from a waveform. This is more or less exactly what this paper does: it calculates a summary vector by aggregating representations across the whole waveform (at which point it has lost word-specific information, but still retains speaker-specific information), and use that vector as input to a network that generates conditional renormalization parameters used in the RNN operations.

This is getting to the point where our intuitions about conditioning in a simple statistical sense start to break down: we’re not conditioning on a category, or even any piece of external data, but rather on information contained the input data point itself, albeit aggregated at a different scale. This makes an important and interesting point that will recur over the last few examples: even if information is theoretically accessible without it, feature modulation can be a good way to repeatedly make a usefully-compressed version of that information available throughout computation.

Squeeze and Excitation Networks

This strategy, of using summarized global information to modulate local feature computation, has also been found to be effective in the canonical problem of the modern machine learning age: large-scale image classification. In 2017, the ImageNet challenge was won by the “Squeeze and Excitation” architecture. S&E’s defining feature is the way it uses feature modulation to communicate information about broad, global behavior of channel activations, which wouldn’t otherwise be visible within a given convolutional layer’s limited receptive field.

The mechanics will sound familiar, given the examples we’ve seen so far: the network computes a per-channel global summary by averaging a given channel’s activation over all spatial regions of the image. Then, this vector is passed into a network to generate a vector of multiplicative factors — one per channel — which control the scaling of a given feature at every spatial location. This is a bit different from the renormalization approach which learns both a scale and shift factor, but similar in essential detail, since both perform conditioning by feature-wise modulation.

An diagram for Squeeze and Excitation networks, showing 1) transforming the spatial features, 2) aggregating them by feature/channel, and 3) using them to shift the values of those channels over the whole image

A central problem of convolutional networks is: how do we balance our desire for a wide receptive field with the high parameter cost we’d have to pay to widen that field by using larger convolutional kernels. This approach sidesteps these issues by creating a conditioning-based pathway for global information flow into what would otherwise be purely local computations.

StyleGAN

So far, we’ve seen models that condition on a class, on an external data object, or on a global-scale representation of the same input data instance. While different from one another, these approaches all have the shared property that the information we condition the rest of the network on is some concrete, existing, specify-able thing: a label, an image, etc. If you asked us what we were conditioning on for a given instance, we’d know ahead of time, and could describe it in words. We give our model some kind of prompt — the class “horse”, this style “van Gogh”, a particular image to answer questions about — and optimize it to be able to behave in a way that corresponds to that prompt

GANs take that conditioning paradigm and turn it upside down. Instead of taking a class or style vector and learning how to map it to a specific desired output, GANs work by taking in a random noise vector, which has no a priori connection to a particular output, and mapping it to a sample from the data distribution. GANs define a correspondence between pixel-based images, and the “code” vectors that represent those images. In a simple, mechanical sense, when you pass in a vector, that corresponds to a single, specific image, since a GAN is just a bunch of deterministic transformations.

So, once the GAN is trained, each vector corresponds to one particular image, and if we “conditioned” the GAN by passing in that vector, it would generate that image. While it’s clearly mathematically true that this mapping exists, it’s not particularly useful for the typical use cases of conditioning: because we couldn’t have known ahead of time which vectors would correspond to which types of images, and, even with a trained model, we don’t have a way of knowing what vector we would need to pass in to produce whatever output we might have in mind. The only way of finding out would be to run the model forward and sample. If we’re lucky, nearby vectors might produce nearby images, but that’s not mathematically required.

This is a core strangeness of GANs: because the model ultimately builds a mapping between vector and output, we can say that the noise vector carries the information that specifies what image is produced, even though it didn’t contain any intentional encoding of that information to start with.

StyleGAN takes this idea — that a given noise vector carries information specifying the particular image it produces — and uses it to create a powerfully effective form of self-conditioning. Each time the StyleGAN runs, it samples a z vector, and then feeds that vector into a network that — you guessed it — produces feature-wise modulation weights for each layer and channel in the image generation hierarchy. This can be seen as a way of reintroducing and reinforcing the global image specification at various points in generation.

StyleGAN samples showing the images you can generate by combining conditioning parameters corresponding to multiple different images

Incidentally, the reason the network is called StyleGAN is that took its inspiration for conditional renormalization from one the first papers we talked about, the one on Adaptive Style Transfer. With a StyleGAN model, you could also do interesting forms of transfer, by using the normalization parameters corresponding to one image at the low level of the network, and ones corresponding to a different image at higher levels. This had the effect of transferring high-level, stylistic features of one face to the generation of another.

If we believe the authors that this is the mechanism at work, it would make sense that continually passing in information about a shared global representation makes it easier for generation layers to behave in a consistent way and coordinate across spatial locations on what image is being created, since the network doesn’t have to worry about passing the specification information forward. This is especially true compared to the world of prior work where the global specification was only fed in once at the input layer of the model, and each local generation only had access to local information when it performed its computation.

What Can We Learn From This?

If you’ve made it this far, I am both impressed and appreciative. I realize this post is different stylistically from others I’ve done: more of a map laying out the conceptual connections between techniques than a detailed explanation into any of them. Also, I realize that it’s enormous. So, given that the mechanics of many of these approaches are relatively straightforward, why was this worth me writing or you reading?

On an object level, I think feature-wise conditional normalization is interesting because it’s impressive and surprising to me that it works as well as it does. It only has the power to modify the values of individual features at a time, it (typically) modifies each feature in the same way at all spatial points, and it modifies each feature the same amount regardless of that feature’s initial value or the initial values of any of the other features at that point. Compared to all the flexible ways you could imagine working on conditioning information, it’s a pretty blunt instrument. The fact that it’s possible for the network to have such coherently different outputs as a result of feature modulation updates my intuitions about deep networks in a way I suspect it was useful to have my intuition updated.

On a more meta level, the aspect of conditional normalization that I find most interesting is that it didn’t start from deep theoretical roots, but instead came about as more or less engineering heuristic, a “one weird trick with Batch Norm” for style transfer in the Adaptive Instance Normalization paper that ended up growing and becoming more complex and varied in its applications. Machine Learning has a dual existence as both a realm of theoretical hypothesis and a practical engineering discipline, and as I watched this practically-motivated technique evolve over time, it felt like watching a conversation between those two sides. This is most salient in the ways that innovations in conditional normalization are often not the center stage contribution of a paper, but just a minor feature of architecture or implementation.

Finally, and somewhat off the path from conditional normalization strictly speaking, I found the idea of self-normalization, or the technique of conditioning local behavior based on a summarized global context, a valuable and compelling one to have in my mental toolkit. I suppose it’s actually a bit conceptually similar to the old encoder decoder model for translation, since in that domain it was obvious that you couldn’t do, for example, local unspooling of a translated sentence without knowing the full context of what you were translating. Approaches like StyleGAN and Squeeze and Excitation networks just take that idea and apply it more broadly.

As a caveat to all of that positivity, in the course of writing this post I’ve become frustrated by the lack of good review papers comparing how well different conditioning mechanisms work, and what their limitations are. In the papers I mention, the authors primary goal is to demonstrate that the kind of conditional renormalization they used worked, but it’s left unclear whether some alternative would have worked better. Can conditional normalization perform well in the many-class conditioning regime of conditional GANs? Do we really need two renormalization parameters to do effective conditioning, or can we get away with one? Though we have many examples of the method working, basic questions like this are still unanswered. So, although it has ranged far afield from its Batch Norm origins, perhaps proponents of conditional renormalization have a lesson to learn from its intellectual ancestor: in the absence of rigorous examination of why and when something works, our understanding is left less solid than like it to be.

As always, I still have questions!

How integral is these methods’ connection to renormalization? Would they work just as well if you just added a feature modulation layer with the same structure, in the absence of a prior Batch Norm operation? My suspicion is that Batch Norm is so generally useful that removing it hasn’t make sense, but given that this whole school of conditioning has its name rooted in its origin of normalization, it would be valuable to know whether that connection was an essential or incidental one

Extending on the above question, are there applications that feature modulation just wouldn’t work for? Would it be too simplistic as a way to do fully class-conditional GANs, for example? (Most of the conditional GAN examples I’ve found still take the concatenate-class-ID-to-the-input-vector approach)

References