Beyond SPADE: Introducing Semantic Region-Adaptive Normalization for Image Synthesis Synced Follow Feb 11 · 4 min read

Image synthesis is a loaded topic in AI. Despite deepfake scandals which saw the tech maliciously co-opted to generate pornographic and other misleading video clips, advanced image synthesis has emerged as a vibrant research field with a wide range of benefits and potential applications for the countless enterprises that are using computer vision technologies. The challenge of making a “fake” image more “real” is attracting more and more machine learning researchers worldwide.

The most popular method for synthesizing photorealistic images given an input semantic layout is spatially-adaptive normalization (SPADE, also known as GauGAN). A SPADE generation however is limited to only one style for each image. This can be a problem if for example different output styles are desired for the image’s different compositional elements. Also, recent studies indicate that inserting style information via multiple layers of a network leads to higher quality images. It’s therefore believed that SPADE architecture, which only inputs its style information at the beginning of network processing, could be improved.

In an attempt to address SPADE shortcomings and boost performance, Peihao Zhuhai and colleagues from King Abdullah University of Science and Technology (KAUST) in Saudi Arabia and the UK’s Cardiff University recently introduced semantic region-adaptive normalization (SEAN), a simple but effective building block for conditional Generative Adversarial Networks (cGAN).

SEAN is conditioned on segmentation masks that describe the semantic regions in the desired output image. Using SEAN normalization, a network architecture can be made to control the style of each semantic region individually.

The SEAN generator network is built on top of SPADE and contains three convolutional network layers with their biases and scales modulated separately by individual SEAN blocks. There are two inputs per SEAN block: the set of style codes for specific regions, and a semantic mask that defines regions to apply the style code.

The training process was formulated as an image reconstruction problem. The image region is first defined by the segmentation masks and distilled by the style encoder. The generator network then reconstructs the entire image by “adding up” the individual image regions. For tuning purposes and parameter optimization, the loss function involves three main terms: conditional adversarial loss, feature matching loss and perceptual loss.

Visual comparison of semantic image synthesis results

Quantitative comparison of reconstruction quality

Quantitative comparison using semantic segmentation

The researchers compared SEAN with leading semantic image synthesis models Pix2PixHD, and SPADE on the CelebAMask-HQ, CityScapes, ADE20K, and Facades datasets. In quantitative comparison using semantic segmentation performance measured by mIoU and accuracy, and generation performance measured by FID, the SEAN generator showed higher visual quality, reconstruction quality, and variability compared to the other methods on all the datasets.

The paper SEAN: Image Synthesis with Semantic Region-Adaptive Normalization is on arXiv.