A tutorial explaining how to train and generate high-quality anime faces with StyleGAN 1/2 neural networks, and tips/scripts for effective StyleGAN use. finished certainty : highly likely importance : 5

When Ian Goodfellow’s first GAN paper came out in 2014, with its blurry 64px grayscale faces, I said to myself, “given the rate at which GPUs & NN architectures improve, in a few years, we’ll probably be able to throw a few GPUs at some anime collection like Danbooru and the results will be hilarious.” There is something intrinsically amusing about trying to make computers draw anime, and it would be much more fun than working with yet more celebrity headshots or ImageNet samples; further, anime/illustrations/drawings are so different from the exclusively-photographic datasets always (over)used in contemporary ML research that I was curious how it would work on anime—better, worse, faster, or different failure modes? Even more amusing—if random images become doable, then text→images would not be far behind.

“Hand-selected Style GAN sample from Asuka Souryuu Langley-finetuned Style GAN ”

So when GANs hit 128px color images on ImageNet, and could do somewhat passable CelebA face samples around 2015, along with my char-RNN experiments, I began experimenting with Soumith Chintala’s implementation of DCGAN, restricting myself to faces of single anime characters where I could easily scrape up ~5–10k faces. (I did a lot of Asuka Souryuu Langley from Neon Genesis Evangelion because she has a color-centric design which made it easy to tell if a GAN run was making any progress: blonde-red hair, blue eyes, and red hair ornaments.)

It did not work. Despite many runs on my laptop & a borrowed desktop, DCGAN never got remotely near to the level of the CelebA face samples, typically topping out at reddish blobs before diverging or outright crashing. Thinking perhaps the problem was too-small datasets & I needed to train on all the faces, I began creating the Danbooru2017 version of “Danbooru2018: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset”. Armed with a large dataset, I subsequently began working through particularly promising members of the GAN zoo, emphasizing SOTA & open implementations.

Among others, I have tried StackGAN/StackGAN++ & Pixel*NN* (failed to get running) , WGAN-GP, Glow, GAN-QP, MSG-GAN, SAGAN, VGAN, PokeGAN, BigGAN , ProGAN, & StyleGAN. These architectures vary widely in their design & core algorithms and which of the many stabilization tricks (Wiatrak & Albrecht 2019) they use, but they were more similar in their results: dismal.

Glow & BigGAN had promising results reported on CelebA & ImageNet respectively, but unfortunately their training requirements were out of the question. (As interesting as SPIRAL and CAN are, no source was released and I couldn’t even attempt them.)

While some remarkable tools like PaintsTransfer/style2paints were created, and there were the occasional semi-successful anime face GANs like IllustrationGAN, the most notable attempt at anime face generation was Make Girls.moe (Jin et al 2017). MGM could, interestingly, do in-browser 256px anime face generation using tiny GANs, but that is a dead end. MGM accomplished that much by making the problem easier: they added some light supervision in the form of a crude tag embedding , and then simplifying the problem drastically to n=42k faces cropped from professional video game character artwork, which I regarded as not an acceptable solution—the faces were small & boring, and it was unclear if this data-cleaning approach could scale to anime faces in general, much less anime images in general. They are recognizably anime faces but the resolution is low and the quality is not great:

2017 SOTA : 16 random Make Girls.Moe face samples (4×4 grid)

Typically, a GAN would diverge after a day or two of training, or it would collapse to producing a limited range of faces (or a single face), or if it was stable, simply converge to a low level of quality with a lot of fuzziness; perhaps the most typical failure mode was heterochromia (which is common in anime but not that common)—mismatched eye colors (each color individually plausible), from the Generator apparently being unable to coordinate with itself to pick consistently. With more recent architectures like VGAN or SAGAN, which carefully weaken the Discriminator or which add extremely-powerful components like self-attention layers, I could reach fuzzy 128px faces.

Given the miserable failure of all the prior NNs I had tried, I had begun to seriously wonder if there was something about non-photographs which made them intrinsically unable to be easily modeled by convolutional neural networks (the common ingredient to them all). Did convolutions render it unable to generate sharp lines or flat regions of color? Did regular GANs work only because photographs were made almost entirely of blurry textures?

But BigGAN demonstrated that a large cutting-edge GAN architecture could scale, given enough training, to all of ImageNet at even 512px. And ProGAN demonstrated that regular CNNs could learn to generate sharp clear anime images with only somewhat infeasible amounts of training. ProGAN (source; video), while expensive and requiring >6 GPU-weeks , did work and was even powerful enough to overfit single-character face datasets; I didn’t have enough GPU time to train on unrestricted face datasets, much less anime images in general, but merely getting this far was exciting. Because, a common sequence in DL/DRL (unlike many areas of AI) is that a problem seems intractable for long periods, until someone modifies a scalable architecture slightly, produces somewhat-credible (not necessarily human or even near-human) results, and then throws a ton of compute/data at it and, since the architecture scales, it rapidly exceeds SOTA and approaches human levels (and potentially exceeds human-level). Now I just needed a faster GAN architecture which I could train a much bigger model with on a much bigger dataset.

A history of GAN generation of anime faces: ‘do want’ to ‘oh no’ to ‘awesome’

StyleGAN was the final breakthrough in providing ProGAN-level capabilities but fast: by switching to a radically different architecture, it minimized the need for the slow progressive growing (perhaps eliminating it entirely ), and learned efficiently at multiple levels of resolution, with bonuses in providing much more control of the generated images with its “style transfer” metaphor.

Examples First, some demonstrations of what is possible with StyleGAN on anime faces: 64 of the best TWDNE anime face samples selected from social media (click to zoom). 100 random sample images from the Style GAN anime faces on TWDNE Even a quick look at the MGM & StyleGAN samples demonstrates the latter to be superior in resolution, fine details, and overall appearance (although the MGM faces admittedly have fewer global mistakes). It is also superior to my 2018 ProGAN faces. Perhaps the most striking fact about these faces, which should be emphasized for those fortunate enough not to have spent as much time looking at awful GAN samples as I have, is not that the individual faces are good, but rather that the faces are so diverse, particularly when I look through face samples with 𝜓≥1—it is not just the hair/eye color or head orientation or fine details that differ, but the overall style ranges from CG to cartoon sketch, and even the ‘media’ differ, I could swear many of these are trying to imitate watercolors, charcoal sketching, or oil painting rather than digital drawings, and some come off as recognizably ’90s-anime-style vs ’00s-anime-style. (I could look through samples all day despite the global errors because so many are interesting, which is not something I could say of the MGM model whose novelty is quickly exhausted, and it appears that users of my TWDNE website feel similarly as the average length of each visit is 1m:55s.) Interpolation video of the 2019-02-11 face Style GAN demonstrating generalization. Style GAN anime face interpolation videos are Elon Musk™-approved ! Later interpolation video (2019-03-08 face Style GAN )

Training requirements Data “The road of excess leads to the palace of wisdom

…If the fool would persist in his folly he would become wise

…You never know what is enough unless you know what is more than enough. …If others had not been foolish, we should be so.” William Blake, “Proverbs of Hell”, The Marriage of Heaven and Hell The necessary size for a dataset depends on the complexity of the domain and whether transfer learning is being used. StyleGAN’s default settings yield a 1024px Generator with 26.2M parameters, which is a large model and can soak up potentially millions of images, so there is no such thing as too much. For learning decent-quality anime faces from scratch, a minimum of 5000 appears to be necessary in practice; for learning a specific character when using the anime face StyleGAN, potentially as little as ~500 (especially with data augmentation) can give good results. For domains as complicated as “any cat photo” like Karras et al 2018’s cat StyleGAN which is trained on the LSUN CATS category of ~1.8M cat photos, that appears to either not be enough or StyleGAN was not trained to convergence; Karras et al 2018 note that “CATS continues to be a difficult dataset due to the high intrinsic variation in poses, zoom levels, and backgrounds.” Compute To fit reasonable minibatch sizes, one will want GPUs with >11GB VRAM. At 512px, that will only train n=4, and going below that means it’ll be even slower (and you may have to reduce learning rates to avoid unstable training). So, Nvidia 1080ti & up would be good. (Reportedly, AMD/OpenCL works for running StyleGAN models, and there is one report of successful training with “Radeon VII with tensorflow-rocm 1.13.2 and rocm 2.3.14”.) The StyleGAN repo provide the following estimated training times for 1–8 GPU systems (which I convert to total GPU-hours & provide a worst-case AWS-based cost estimate): Estimated Style GAN wallclock training times for various resolutions & GPU -clusters (source: Style GAN repo) GPU s 10242 5122 2562 [March 2019 AWS Costs ] 1 41 days 4 hours [988 GPU -hours] 24 days 21 hours [597 GPU -hours] 14 days 22 hours [358 GPU -hours] [$320, $194, $115] 2 21 days 22 hours [1,052] 13 days 7 hours [638] 9 days 5 hours [442] [NA] 4 11 days 8 hours [1,088] 7 days 0 hours [672] 4 days 21 hours [468] [NA] 8 6 days 14 hours [1,264] 4 days 10 hours [848] 3 days 8 hours [640] [$2,730, $1,831, $1,382] AWS GPU instances are some of the most expensive ways to train a NN and provide an upper bound (compare Vast.ai); 512px is often an acceptable (or necessary) resolution; and in practice, the full quoted training time is not really necessary—with my anime face StyleGAN, the faces themselves were high quality within 48 GPU-hours, and what training it for ~1000 additional GPU-hours accomplished was primarily to improve details like the shoulders & backgrounds. (ProGAN/StyleGAN particularly struggle with backgrounds & edges of images because those are cut off, obscured, and highly-varied compared to the faces, whether anime or FFHQ. I hypothesize that the telltale blurry backgrounds are due to the impoverishment of the backgrounds/edges in cropped face photos, and they could be fixed by transfer-learning or pretraining on a more generic dataset like ImageNet, so it learns what the backgrounds even are in the first place; then in face training, it merely has to remember them & defocus a bit to generate correct blurry backgrounds.) Training improvements: 256px Style GAN anime faces after ~46 GPU -hours vs 512px anime faces after 382 GPU -hours; see also the video montage of first 9k iterations

Data Preparation The most difficult part of running StyleGAN is preparing the dataset properly. StyleGAN does not, unlike most GAN implementations (particularly PyTorch ones), support reading a directory of files as input; it can only read its unique .tfrecord format which stores each image as raw arrays at every relevant resolution. Thus, input files must be perfectly uniform, (slowly) converted to the .tfrecord format by the special dataset_tool.py tool, and will take up ~19× more disk space. A StyleGAN dataset must consist of images all formatted exactly the same way Images must be precisely 512×512px or 1024×1024px etc (any eg 512×513px images will kill the entire run), they must all be the same colorspace (you cannot have sRGB and Grayscale JPGs—and I doubt other color spaces work at all), the filetype must be the same as the model you intend to (re)train (ie you cannot retrain a PNG-trained model on a JPG dataset, StyleGAN will crash every time with inscrutable convolution/channel-related errors) , and there must be no subtle errors like CRC checksum errors which image viewers or libraries like ImageMagick often ignore. Faces preparation My workflow: Download raw images from Danbooru2018 if necessary Extract from the JSON Danbooru2018 metadata all the IDs of a subset of images if a specific Danbooru tag (such as a single character) is desired, using jq and shell scripting Crop square anime faces from raw images using Nagadomi’s lbpcascade_animeface (regular face-detection methods do not work on anime images) Delete empty files, monochrome or grayscale files, & exact-duplicate files Convert to JPG Upscale below the target resolution (512px) images with waifu2x Convert all images to exactly 512×512 resolution s RGB JPG images If feasible, improve data quality by checking for low-quality images by hand, removing near-duplicates images found by findimagedupes , and filtering with a pretrained GAN ’s Discriminator Convert to Style GAN format using dataset_tool.py The goal is to turn this: 100 random sample images from the 512px SFW subset of Danbooru in a 10×10 grid. into this: 36 random sample images from the cropped Danbooru faces in a 6×6 grid. Below I use shell scripting to prepare the dataset. A possible alternative is danbooru-utility , which aims to help “explore the dataset, filter by tags, rating, and score, detect faces, and resize the images”. Cropping The Danbooru2018 download can be done via BitTorrent or rsync, which provides a JSON metadata tarball which unpacks into metadata/2* & a folder structure of {original,512px}/{0-999}/$ID.{png,jpg,...} . For training on SFW whole images, the 512px/ version of Danbooru2018 would work, but it is not a great idea for faces because by scaling images down to 512px, a lot of face detail has been lost, and getting high-quality faces is a challenge. The SFW IDs can be extracted from the filenames in 512px/ directly or from the metadata by extracting the id & rating fields (and saving to a file): find ./512px/ -type f | sed -e 's/.*\/\([[:digit:]]*\)\.jpg/\1/' # 967769 # 1853769 # 2729769 # 704769 # 1799769 # ... tar xf metadata.json.tar.xz cat metadata/20180000000000* | jq '[.id, .rating]' -c | fgrep '"s"' | cut -d '"' -f 2 # " # ... After installing and testing Nagadomi’s lbpcascade_animeface to make sure it & OpenCV works, one can use a simple script which crops the face(s) from a single input image. The accuracy on Danbooru images is fairly good, perhaps 90% excellent faces, 5% low-quality faces (genuine but either awful art or tiny little faces on the order of 64px which useless), and 5% outright errors—non-faces like armpits or elbows (oddly enough). It can be improved by making the script more restrictive, such as requiring 250×250px regions, which eliminates most of the low-quality faces & mistakes. (There is an alternative more-difficult-to-run library by Nakatomi which offers a face-cropping script, animeface-2009’s face_collector.rb , which Nakatomi says is better at cropping faces, but I was not impressed when I tried it out.) crop.py : import cv2 import sys import os.path def detect(cascade_file, filename, outputname): if not os.path.isfile(cascade_file): raise RuntimeError ( " %s : not found" % cascade_file) cascade = cv2.CascadeClassifier(cascade_file) image = cv2.imread(filename) gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) gray = cv2.equalizeHist(gray) ## NOTE : Suggested modification: increase minSize to '(250,250)' px, ## increasing proportion of high-quality faces & reducing ## false positives. Faces which are only 50×50px are useless ## and often not faces at all. ## FOr my StyleGANs, I use 250 or 300px boxes faces = cascade.detectMultiScale(gray, # detector options scaleFactor = 1.1 , minNeighbors = 5 , minSize = ( 50 , 50 )) i = 0 for (x, y, w, h) in faces: cropped = image[y: y + h, x: x + w] cv2.imwrite(outputname + str (i) + ".png" , cropped) i = i + 1 if len (sys.argv) != 4 : sys.stderr.write( "usage: detect.py <animeface.xml file> <input> <output prefix>

" ) sys.exit( - 1 ) detect(sys.argv[ 1 ], sys.argv[ 2 ], sys.argv[ 3 ]) The IDs can be combined with the provided lbpcascade_animeface script using xargs , however this will be far too slow and it would be better to exploit parallelism with xargs --max-args=1 --max-procs=16 or parallel . It’s also worth noting that lbpcascade_animeface seems to use up GPU VRAM even though GPU use offers no apparent speedup (a slowdown if anything, given limited VRAM), so I find it helps to explicitly disable GPU use by setting CUDA_VISIBLE_DEVICES="" . (For this step, it’s quite helpful to have a many-core system like a Threadripper.) Combining everything, parallel face-cropping of an entire Danbooru2018 subset can be done like this: cropFaces() { BUCKET=$( printf "%04d" $(( $@ % 1000 )) ) ID= " $@ " CUDA_VISIBLE_DEVICES= "" nice python ~/src/lbpcascade_animeface/examples/crop.py \ ~/src/lbpcascade_animeface/lbpcascade_animeface.xml \ ./original/ $BUCKET / $ID .* "./faces/ $ID " } export -f cropFaces mkdir ./faces/ cat sfw-ids.txt | parallel --progress cropFaces # NOTE : because of the possibility of multiple crops from an image, the script appends a N counter; # remove that to get back the original ID & filepath: eg # ## original/0196/933196.jpg → portrait/9331961.jpg ## original/0669/1712669.png → portrait/17126690.jpg ## original/0997/3093997.jpg → portrait/30939970.jpg Nvidia StyleGAN, by default and like most image-related tools, expects square images like 512×512px, but there is nothing inherent to neural nets or convolutions that requires square inputs or outputs, and rectangular convolutions are possible. In the case of faces, they tend to be more rectangular than square, and we’d prefer to use a rectangular convolution if possible to focus the image on the relevant dimension rather than either pay the severe performance penalty of increasing total dimensions to 1024×1024px or stick with 512×512px & waste image outputs on emitting black bars/backgrounds. A properly-sized rectangular convolution can offer a nice speedup (eg Fast.ai’s training ImageNet in 18m for $40 using them among other tricks). Nolan Kent’s StyleGAN re-implemention (released October 2019) does support rectangular convolutions, and as he demonstrates in his blog post, it works nicely. Cleaning & Upscaling Miscellaneous cleanups can be done: ## Delete failed/empty files find faces/ -size 0 -type f -delete ## Delete 'too small' files which is indicative of low quality: find faces/ -size -40k -type f -delete ## Delete exact duplicates: fdupes --delete --omitfirst --noprompt faces/ ## Delete monochrome or minimally-colored images: ### the heuristic of <257 unique colors is imperfect but better than anything else I tried deleteBW() { if [[ ` identify -format "%k" " $@ " ` -lt 257 ]] ; then rm " $@ " ; fi ; } export -f deleteBW find faces -type f | parallel --progress deleteBW I remove black-white or grayscale images from all my GAN experiments because in my earliest experiments, their inclusion appeared to increase instability: mixed datasets were extremely unstable, monochrome datasets failed to learn at all, but color-only runs made some progress. It is likely that StyleGAN is now powerful enough to be able to learn on mixed datasets (and some later experiments by other people suggest that StyleGAN can handle both monochrome & color anime-style faces without a problem), but I have not risked a full month-long run to investigate, and so I continue doing color-only. Discriminator ranking A good trick with GANs is, after training to reasonable levels of quality, reusing the Discriminator to rank the real datapoints; images the trained D assigns the lowest probability/score of being real are often the worst-quality ones and going through the bottom decile (or deleting them entirely) should remove many anomalies and may improve the GAN. The GAN is then trained on the new cleaned dataset, making this a kind of “active learning”. Since rating images is what the D already does, no new algorithms or training methods are necessary, and almost no code is necessary: run the D on the whole dataset to rank each image (faster than it seems since the G & backpropagation are unnecessary, even a large dataset can be ranked in a wallclock hour or two), then one can review manually the bottom & top X%, or perhaps just delete the bottom X% sight unseen if enough data is available. What is a D doing? I find that the highest ranked images often contain many anomalies or low-quality images which need to be deleted. Why? The BigGAN paper notes a well-trained D which achieves 98% real vs fake classification performance on the ImageNet training dataset falls to 50–55% accuracy when run on the validation dataset, suggesting the D’s role is about memorizing the training data rather than some measure of ‘realism’. Perhaps because the D ranking is not necessarily a ‘quality’ score but simply a sort of confidence rating that an image is from the real dataset; if the real images contain certain easily-detectable images which the G can’t replicate, then the D might memorize or learn them quickly. For example, in face crops, whole figure crops are common mistaken crops, making up a tiny percentage of images; how could a face-only G learn to generate whole realistic bodies without the intermediate steps being instantly detected & defeated as errors by D, while D is easily able to detect realistic bodies as definitely real? This would explain the polarized rankings. And given the close connections between GANs & DRL, I have to wonder if there is more memorization going on than suspected in things like “Deep reinforcement learning from human preferences”? Incidentally, this may also explain the problem with using Discriminators for semi-supervised representation learning: if the D is memorizing datapoints to force the G to generalize, then its internal representations would be expected to be useless. (One would instead want to extract knowledge from the G, perhaps by encoding an image into z and using the z as the representation.) An alternative perspective is offered by a crop of 2020 papers (Zhao et al 2020b; Tran et al 2020; Karras et al 2020; Zhao et al 2020c) examining how useful GAN data augmentation requires it to be done during training, and one must augment all images. Zhao et al 2020c & Karras et al 2020 observe, with regular GAN training, there is a striking steady decline of D performance on heldout data, and increase on training data, throughout the course of training, confirming the BigGAN observation but also showing it is a dynamic phenomenon, and probably a bad one. Adding in correct data augmentation reduces this overfitting—and markedly improves sample-efficiency & final quality. This suggests that the D does indeed memorize, but that this is not a good thing. Karras et al 2020 describes what happens as Convergence is now achieved [with ADA/data augmentation] regardless of the training set size and overfitting no longer occurs. Without augmentations, the gradients the generator receives from the discriminator become very simplistic over time—the discriminator starts to pay attention to only a handful of features, and the generator is free to create otherwise nonsensical images. With ADA, the gradient field stays much more detailed which prevents such deterioration. In other words, just as the G can ‘mode collapse’ by focusing on generating images with only a few features, the D can also ‘feature collapse’ by focusing on a few features which happen to correctly split the training data’s reals from fakes, such as by memorizing them outright. This technically works, but not well. This also explains why BigGAN training stabilized when training on JFT-300M: divergence/collapse usually starts with D winning; if D wins because it memorizes, then a sufficiently large dataset should make memorization infeasible; and JFT-300M turns out to be sufficiently large. (This would predict that if Brock et al had checked the JFT-300M BigGAN D’s classification performance on a held-out JFT-300M, rather than just on their ImageNet BigGAN, then they would have found that it classified reals vs fake well above chance.) If so, this suggests that for D ranking, it may not be too useful to take the D from the end of a run, if not using data augmentation, because that D be the version with the greatest degree of memorization! Here is a simple StyleGAN2 script ( ranker.py ) to open a StyleGAN .pkl and run it on a list of image filenames to print out the D score, courtesy of Shao Xuning: import pickle import numpy as np import cv2 import dnnlib.tflib as tflib import random import argparse import PIL.Image from training.misc import adjust_dynamic_range def preprocess(file_path): # print(file_path) img = np.asarray(PIL.Image. open (file_path)) # Preprocessing from dataset_tool.create_from_images img = img.transpose([ 2 , 0 , 1 ]) # HWC => CHW # img = np.expand_dims(img, axis=0) img = img.reshape(( 1 , 3 , 512 , 512 )) # Preprocessing from training_loop.process_reals img = adjust_dynamic_range(data = img, drange_in = [ 0 , 255 ], drange_out = [ - 1.0 , 1.0 ]) return img def main(args): random.seed(args.random_seed) minibatch_size = args.minibatch_size input_shape = (minibatch_size, 3 , 512 , 512 ) # print(args.images) images = args.images images.sort() tflib.init_tf() _G, D, _Gs = pickle.load( open (args.model, "rb" )) # D.print_layers() image_score_all = [(image, []) for image in images] # Shuffle the images and process each image in multiple minibatches. # Note: networks.stylegan2.minibatch_stddev_layer # calculates the standard deviation of a minibatch group as a feature channel, # which means that the output of the discriminator actually depends # on the companion images in the same minibatch. for i_shuffle in range (args.num_shuffles): # print('shuffle: {}'.format(i_shuffle)) random.shuffle(image_score_all) for idx_1st_img in range ( 0 , len (image_score_all), minibatch_size): idx_img_minibatch = [] images_minibatch = [] input_minibatch = np.zeros(input_shape) for i in range (minibatch_size): idx_img = (idx_1st_img + i) % len (image_score_all) idx_img_minibatch.append(idx_img) image = image_score_all[idx_img][ 0 ] images_minibatch.append(image) img = preprocess(image) input_minibatch[i, :] = img output = D.run(input_minibatch, None , resolution = 512 ) print ( 'shuffle: {} , indices: {} , images: {} ' . format (i_shuffle, idx_img_minibatch, images_minibatch)) print ( 'Output: {} ' . format (output)) for i in range (minibatch_size): idx_img = idx_img_minibatch[i] image_score_all[idx_img][ 1 ].append(output[i][ 0 ]) with open (args.output, 'a' ) as fout: for image, score_list in image_score_all: print ( 'Image: {} , score_list: {} ' . format (image, score_list)) avg_score = sum (score_list) / len (score_list) fout.write(image + ' ' + str (avg_score) + '

' ) def parse_arguments(): parser = argparse.ArgumentParser() parser.add_argument( '--model' , type = str , required = True , help = '.pkl model' ) parser.add_argument( '--images' , nargs = '+' ) parser.add_argument( '--output' , type = str , default = 'rank.txt' ) parser.add_argument( '--minibatch_size' , type = int , default = 4 ) parser.add_argument( '--num_shuffles' , type = int , default = 5 ) parser.add_argument( '--random_seed' , type = int , default = 0 ) return parser.parse_args() if __name__ == '__main__' : main(parse_arguments()) Depending on how noisy the rankings are in terms of ‘quality’ and available sample size, one can either review the worst-ranked images by hand, or delete the bottom X%. One should check the top-ranked images as well to make sure the ordering is right; there can also be some odd images in the top X% as well which should be removed. It might be possible to use ranker.py to improve the quality of generated samples as well, as a simple version of discriminator rejection sampling. Upscaling The next major step is upscaling images using waifu2x , which does an excellent job on 2× upscaling of anime images, which are nigh-indistinguishable from a higher-resolution original and greatly increase the usable corpus. The downside is that it can take 1–10s per image, must run on the GPU (I can reliably fit ~9 instances on my 2×1080ti), and is written in a now-unmaintained DL framework, Torch, with no current plans to port to PyTorch, and is gradually becoming harder to get running (one hopes that by the time CUDA updates break it entirely, there will be another super-resolution GAN I or someone else can train on Danbooru to replace it). If pressed for time, one can just upscale the faces normally with ImageMagick but I believe there will be some quality loss and it’s worthwhile. . ~/src/torch/install/bin/torch-activate upscaleWaifu2x() { SIZE1=$( identify -format "%h" " $@ " ) SIZE2=$( identify -format "%w" " $@ " ) ; if (( $SIZE1 < 512 && $SIZE2 < 512 )) ; then echo " $@ " $SIZE TMP=$( mktemp "/tmp/XXXXXX.png" ) CUDA_VISIBLE_DEVICES= " $(( RANDOM % 2 < 1 )) " nice th ~/src/waifu2x/waifu2x.lua -model_dir \ ~/src/waifu2x/models/upconv_7/art -tta 1 -m scale -scale 2 \ -i " $@ " -o " $TMP " convert " $TMP " " $@ " rm " $TMP " fi ; } export -f upscaleWaifu2x find faces/ -type f | parallel --progress --jobs 9 upscaleWaifu2x Quality Checks & Data Augmentation The single most effective strategy to improve a GAN is to clean the data. StyleGAN cannot handle too-diverse datasets composed of multiple objects or single objects shifted around, and rare or odd images cannot be learned well. Karras et al get such good results with StyleGAN on faces in part because they constructed FFHQ to be an extremely clean consistent dataset of just centered well-lit clear human faces without any obstructions or other variation. Similarly, Arfa’s “This Fursona Does Not Exist” (TFDNE) S2 generates much better furry portraits than my own “This Waifu Does Not Exist” (TWDNE) S2 anime portraits, due partly to training longer to convergence on a TPU pod but mostly due to his investment in data cleaning: aligning the faces and heavy filtering of samples—this left him with only n=50k but TFDNE nevertheless outperforms TWDNE’s n=300k. (Data cleaning/augmentation is one of the more powerful ways to improve results; if we imagine deep learning as ‘programming’ or ‘Software 2.0’ in Andrej Karpathy’s terms, data cleaning/augmentation is one of the easiest ways to finetune the loss function towards what we really want by gardening our data to remove what we don’t want and increase what we do.) At this point, one can do manual quality checks by viewing a few hundred images, running findimagedupes -t 99% to look for near-identical faces, or dabble in further modifications such as doing “data augmentation”. Working with Danbooru2018, at this point one would have ~600–700,000 faces, which is more than enough to train StyleGAN and one will have difficulty storing the final StyleGAN dataset because of its sheer size (due to the ~18× size multiplier). After cleaning etc, my final face dataset is the portrait dataset with n=300k. However, if that is not enough or one is working with a small dataset like for a single character, data augmentation may be necessary. The mirror/horizontal flip is not necessary as StyleGAN has that built-in as an option , but there are many other possible data augmentations. One can stretch, shift colors, sharpen, blur, increase/decrease contrast/brightness, crop, and so on. An example, extremely aggressive, set of data augmentations could be done like this: dataAugment () { image= " $@ " target=$( basename " $@ " ) suffix= "png" convert -deskew 50 " $image " " $target " .deskew. " $suffix " convert -resize 110%x100% " $image " " $target " .horizstretch. " $suffix " convert -resize 100%x110% " $image " " $target " .vertstretch. " $suffix " convert -blue-shift 1.1 " $image " " $target " .midnight. " $suffix " convert -fill red -colorize 5% " $image " " $target " .red. " $suffix " convert -fill orange -colorize 5% " $image " " $target " .orange. " $suffix " convert -fill yellow -colorize 5% " $image " " $target " .yellow. " $suffix " convert -fill green -colorize 5% " $image " " $target " .green. " $suffix " convert -fill blue -colorize 5% " $image " " $target " .blue. " $suffix " convert -fill purple -colorize 5% " $image " " $target " .purple. " $suffix " convert -adaptive-blur 3x2 " $image " " $target " .blur. " $suffix " convert -adaptive-sharpen 4x2 " $image " " $target " .sharpen. " $suffix " convert -brightness-contrast 10 " $image " " $target " .brighter. " $suffix " convert -brightness-contrast 10x10 " $image " " $target " .brightercontraster. " $suffix " convert -brightness-contrast -10 " $image " " $target " .darker. " $suffix " convert -brightness-contrast -10x10 " $image " " $target " .darkerlesscontrast. " $suffix " convert +level 5% " $image " " $target " .contraster. " $suffix " convert -level 5%\! " $image " " $target " .lesscontrast. " $suffix " } export -f dataAugment find faces/ -type f | parallel --progress dataAugment Upscaling & Conversion Once any quality fixes or data augmentation are done, it’d be a good idea to save a lot of disk space by converting to JPG & lossily reducing quality (I find 33% saves a ton of space at no visible change): convertPNGToJPG() { convert -quality 33 " $@ " " $@ " .jpg && rm " $@ " ; } export -f convertPNGToJPG find faces/ -type f -name "*.png" | parallel --progress convertPNGToJPG Remember that StyleGAN models are only compatible with images of the type they were trained on, so if you are using a StyleGAN pretrained model which was trained on PNGs (like, IIRC, the FFHQ StyleGAN models), you will need to keep using PNGs. Doing the final scaling to exactly 512px can be done at many points but I generally postpone it to the end in order to work with images in their ‘native’ resolutions & aspect-ratios for as long as possible. At this point we carefully tell ImageMagick to rescale everything to 512×512 , not preserving the aspect ratio by filling in with a black background as necessary on either side: find faces/ -type f | xargs --max-procs=16 -n 9000 \ mogrify -resize 512x512 \> -extent 512x512 \> -gravity center -background black Any slightly-different image could crash the import process. Therefore, we delete any image which is even slightly different from the 512×512 sRGB JPG they are supposed to be: find faces/ -type f | xargs --max-procs=16 -n 9000 identify | \ # remember the warning: images must be identical, square, and sRGB/grayscale: fgrep -v " JPEG 512x512 512x512+0+0 8-bit sRGB" | cut -d ' ' -f 1 | \ xargs --max-procs=16 -n 10000 rm Having done all this, we should have a large consistent high-quality dataset. Finally, the faces can now be converted to the ProGAN or StyleGAN dataset format using dataset_tool.py . It is worth remembering at this point how fragile that is and the requirements ImageMagick’s identify command is handy for looking at files in more details, particularly their resolution & colorspace, which are often the problem. Because of the extreme fragility of dataset_tool.py , I strongly advise that you edit it to print out the filenames of each file as they are being processed so that when (not if) it crashes, you can investigate the culprit and check the rest. The edit could be as simple as this: diff --git a/dataset_tool.py b/dataset_tool.py index 4ddfe44..e64e40b 100755 --- a/dataset_tool.py +++ b/dataset_tool.py @@ -519,6 +519,7 @@ def create_from_images(tfrecord_dir, image_dir, shuffle): with TFRecordExporter(tfrecord_dir, len(image_filenames)) as tfr: order = tfr.choose_shuffled_order() if shuffle else np.arange(len(image_filenames)) for idx in range(order.size): + print(image_filenames[order[idx]]) img = np.asarray(PIL.Image.open(image_filenames[order[idx]])) if channels == 1: img = img[np.newaxis, :, :] # HW => CHW There should be no issues if all the images were thoroughly checked earlier, but should any images crash it, they can be checked in more detail by identify . (I advise just deleting them and not trying to rescue them.) Then the conversion is just (assuming StyleGAN prerequisites are installed, see next section): source activate MY_TENSORFLOW_ENVIRONMENT python dataset_tool.py create_from_images datasets/faces /media/gwern/Data/danbooru2018/faces/ Congratulations, the hardest part is over. Most of the rest simply requires patience (and a willingness to edit Python files directly in order to configure StyleGAN).

Training Installation I assume you have CUDA installed & functioning. If not, good luck. (On my Ubuntu Bionic 18.04.2 LTS OS, I have successfully used the Nvidia driver version #410.104, CUDA 10.1, and TensorFlow 1.13.1.) A Python ≥3.6 virtual environment can be set up for StyleGAN to keep dependencies tidy, TensorFlow & StyleGAN dependencies installed: conda create -n stylegan pip python=3.6 source activate stylegan ## TF: pip install tensorflow-gpu ## Test install: python -c "import tensorflow as tf; tf.enable_eager_execution(); \ print(tf.reduce_sum(tf.random_normal([1000, 1000])))" pip install tensorboard ## StyleGAN: ## Install pre-requisites: pip install pillow numpy moviepy scipy opencv-python lmdb # requests? ## Download: git clone 'https://github.com/NVlabs/stylegan.git' && cd ./stylegan/ ## Test install: python pretrained_example.py ## ./results/example.png should be a photograph of a middle-aged man StyleGAN can also be trained on the interactive Google Colab service, which provides free slices of K80 GPUs 12-GPU-hour chunks, using this Colab notebook. Colab is much slower than training on a local machine & the free instances are not enough to train the best StyleGANs, but this might be a useful option for people who simply want to try it a little or who are doing something quick like extremely low-resolution training or transfer-learning where a few GPU-hours on a slow small GPU might be enough. Configuration StyleGAN doesn’t ship with any support for CLI options; instead, one must edit train.py and train/training_loop.py : train/training_loop.py The core configuration is done in the function defaults to training_loop beginning line 112. The key arguments are G_smoothing_kimg & D_repeats (affects the learning dynamics), network_snapshot_ticks (how often to save the pickle snapshots—more frequent means less progress lost in crashes, but as each one weighs 300MB+, can quickly use up gigabytes of space), resume_run_id (set to "latest" ), and resume_kimg . resume_kimg governs where in the overall progressive-growing training schedule Style GAN starts from. If it is set to 0, training begins at the beginning of the progressive-growing schedule, at the lowest resolution, regardless of how much training has been previously done. It is vitally important when doing transfer learning that it is set to a sufficiently high number (eg 10000) that training begins at the highest desired resolution like 512px, as it appears that layers are erased when added during progressive-growing. ( resume_kimg may also need to be set to a high value to make it skip straight to training at the highest resolution if you are training on small datasets of small images, where there’s risk of it overfitting under the normal training schedule and never reaching the highest resolution.) This trick is unnecessary in Style GAN 2, which is simpler in not using progressive growing. More experimentally, I suggest setting minibatch_repeats = 1 instead of minibatch_repeats = 5 ; in line with the suspiciousness of the gradient-accumulation implementation in ProGAN/StyleGAN, this appears to make training both stabler & faster. Note that some of these variables, like learning rates, are overridden in train.py . It’s better to set those there or else you may confuse yourself badly (like I did in wondering why ProGAN & StyleGAN seemed extraordinarily robust to large changes in the learning rates…). train.py (previously config.py in ProGAN; renamed run_training.py in StyleGAN 2) Here we set the number of GPUs, image resolution, dataset, learning rates, horizontal flipping/mirroring data augmentation, and minibatch sizes. (This file includes settings intended ProGAN—watch out that you don’t accidentally turn on ProGAN instead of StyleGAN & confuse yourself.) Learning rate & minibatch should generally be left alone (except towards the end of training when one wants to lower the learning rate to promote convergence or rebalance the G/D), but the image resolution/dataset/mirroring do need to be set, like thus: desc += '-faces' ; dataset = EasyDict(tfrecord_dir = 'faces' , resolution = 512 ) ; train.mirror_augment = True This sets up the 512px face dataset which was previously created in dataset/faces , turns on mirroring (because while there may be writing in the background, we don’t care about it for face generation), and sets a title for the checkpoints/logs, which will now appear in results/ with the ‘-faces’ string. Assuming you do not have 8 GPUs (as you probably do not), you must change the -preset to match your number of GPUs, StyleGAN will not automatically choose the correct number of GPUs. If you fail to set it correctly to the appropriate preset, StyleGAN will attempt to use GPUs which do not exist and will crash with the opaque error message (note that CUDA uses zero-indexing so GPU:0 refers to the first GPU, GPU:1 refers to my second GPU, and thus /device:GPU:2 refers to my—nonexistent—third GPU): tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation \ G_synthesis_3/lod: {{node G_synthesis_3/lod}}was explicitly assigned to /device:GPU:2 but available \ devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, \ /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:XLA_CPU:0, \ /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. \ Make sure the device specification refers to a valid device. [[{{node G_synthesis_3/lod}}]] For my 2×1080ti I’d set: desc += '-preset-v2-2gpus' ; submit_config.num_gpus = 2 ; sched.minibatch_base = 8 ; sched.minibatch_dict = \ { 4 : 256 , 8 : 256 , 16 : 128 , 32 : 64 , 64 : 32 , 128 : 16 , 256 : 8 } ; sched.G_lrate_dict = { 512 : 0.0015 , 1024 : 0.002 } ; \ sched.D_lrate_dict = EasyDict(sched.G_lrate_dict) ; train.total_kimg = 99000 So my results get saved to results/00001-sgan-faces-2gpu etc (the run ID increments, ‘sgan’ because StyleGAN rather than ProGAN, ‘-faces’ as the dataset being trained on, and ‘2gpu’ because it’s multi-GPU). Running I typically run StyleGAN in a screen session which can be detached and keeps multiple shells organized: 1 terminal/shell for the StyleGAN run, 1 terminal/shell for TensorBoard, and 1 for Emacs. With Emacs, I keep the two key Python files open ( train.py and train/training_loop.py ) for reference & easy editing. With the “latest” patch, StyleGAN can be thrown into a while-loop to keep running after crashes, like: while true ; do nice py train.py ; date ; ( xmessage "alert: StyleGAN crashed" &) ; sleep 10s ; done TensorBoard is a logging utility which displays little time-series of recorded variables which one views in a web browser, eg: tensorboard --logdir results/02022-sgan-faces-2gpu/ # TensorBoard 1.13.0 at http://127.0.0.1:6006 (Press CTRL+C to quit) Note that TensorBoard can be backgrounded, but needs to be updated every time a new run is started as the results will then be in a different folder. Training StyleGAN is much easier & more reliable than other GANs, but it is still more of an art than a science. (We put up with it because while GANs suck, everything else sucks more.) Notes on training: Crashproofing : The initial release of StyleGAN was prone to crashing when I ran it, segfaulting at random. Updating TensorFlow appeared to reduce this but the root cause is still unknown. Segfaulting or crashing is also reportedly common if running on mixed GPUs (eg a 1080ti + Titan V). Unfortunately, StyleGAN has no setting for simply resuming from the latest snapshot after crashing/exiting (which is what one usually wants), and one must manually edit the resume_run_id line in training_loop.py to set it to the latest run ID. This is tedious and error-prone—at one point I realized I had wasted 6 GPU-days of training by restarting from a 3-day-old snapshot because I had not updated the resume_run_id after a segfault! If you are doing any runs longer than a few wallclock hours, I strongly advise use of nshepperd’s patch to automatically restart from the latest snapshot by setting resume_run_id = "latest" : diff --git a/training/misc.py b/training/misc.py index 50ae51c..d906a2d 100755 --- a/training/misc.py +++ b/training/misc.py @@ -119,6 +119,14 @@ def list_network_pkls(run_id_or_run_dir, include_final=True): del pkls[0] return pkls +def locate_latest_pkl(): + allpickles = sorted(glob.glob(os.path.join(config.result_dir, '0*', 'network-*.pkl'))) + latest_pickle = allpickles[-1] + resume_run_id = os.path.basename(os.path.dirname(latest_pickle)) + RE_KIMG = re.compile('network-snapshot-(\d+).pkl') + kimg = int(RE_KIMG.match(os.path.basename(latest_pickle)).group(1)) + return (locate_network_pkl(resume_run_id), float(kimg)) + def locate_network_pkl(run_id_or_run_dir_or_network_pkl, snapshot_or_network_pkl=None): for candidate in [snapshot_or_network_pkl, run_id_or_run_dir_or_network_pkl]: if isinstance(candidate, str): diff --git a/training/training_loop.py b/training/training_loop.py index 78d6fe1..20966d9 100755 --- a/training/training_loop.py +++ b/training/training_loop.py @@ -148,7 +148,10 @@ def training_loop( # Construct networks. with tf.device('/gpu:0'): if resume_run_id is not None: - network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot) + if resume_run_id == 'latest': + network_pkl, resume_kimg = misc.locate_latest_pkl() + else: + network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot) print('Loading networks from "%s"...' % network_pkl) G, D, Gs = misc.load_pkl(network_pkl) else: (The diff can be edited by hand, or copied into the repo as a file like latest.patch & then applied with git apply latest.patch .)

Tuning Learning Rates The LR is one of the most critical hyperparameters: too-large updates based on too-small minibatches are devastating to GAN stability & final quality. The LR also seems to interact with the intrinsic difficulty or diversity of an image domain; Karras et al 2019 use 0.003 G/D LRs on their FFHQ dataset (which has been carefully curated and the faces aligned to put landmarks like eyes/mouth in the same locations in every image) when training on 8-GPU machines with minibatches of n=32, but I find lower to be better on my anime face/portrait datasets where I can only do n=8. From looking at training videos of whole-Danbooru2018 StyleGAN runs, I suspect that the necessary LRs would be lower still. Learning rates are closely related to minibatch size (a common rule of thumb in supervised learning of CNNs is that the relationship of biggest usable LR follows a square-root curve in minibatch size) and the BigGAN research argues that minibatch size itself strongly influences how bad mode dropping is, which suggests that smaller LRs may be more necessary the more diverse/difficult a dataset is.

Balancing G/D : Screenshot of TensorBoard G/D losses for an anime face Style GAN making progress towards convergence Later in training, if the G is not making good progress towards the ultimate goal of a 0.5 loss (and the D’s loss gradually decreasing towards 0.5), and has a loss stubbornly stuck around −1 or something, it may be necessary to change the balance of G/D. This can be done several ways but the easiest is to adjust the LRs in train.py , sched.G_lrate_dict & sched.D_lrate_dict . One needs to keep an eye on the G/D losses and also the perceptual quality of the faces (since we don’t have any good FID equivalent yet for anime faces, which requires a good open-source Danbooru tagger to create embeddings), and reduce both LRs (or usually just the D’s LR) based on the face quality and whether the G/D losses are exploding or otherwise look imbalanced. What you want, I think, is for the G/D losses to be stable at a certain absolute amount for a long time while the quality visibly improves, reducing D’s LR as necessary to keep it balanced with G; and then once you’ve run out of time/patience or artifacts are showing up, then you can decrease both LRs to converge onto a local optima. I find the default of 0.003 can be too high once quality reaches a high level with both faces & portraits, and it helps to reduce it by a third to 0.001 or a tenth to 0.0003. If there still isn’t convergence, the D may be too strong and it can be turned down separately, to a tenth or a fiftieth even. (Given the stochasticity of training & the relativity of the losses, one should wait several wallclock hours or days after each modification to see if it made a difference.)

Skipping FID metrics : Some metrics are computed for logging/reporting. The FID metrics are calculated using an old ImageNet CNN; what is realistic on ImageNet may have little to do with your particular domain and while a large FID like 100 is concerning, FIDs like 20 or even increasing are not necessarily a problem or useful guidance compared to just looking at the generated samples or the loss curves. Given that computing FID metrics is not free & potentially irrelevant or misleading on many image domains, I suggest disabling them entirely. (They are not used in the training for anything, and disabling them is safe.) They can be edited out of the main training loop by commenting out the call to metrics.run like so: @@ - 261 , 7 + 265 , 7 @@ def training_loop() if cur_tick % network_snapshot_ticks == 0 or done or cur_tick == 1 : pkl = os.path.join(submit_config.run_dir, 'network-snapshot- %06d .pkl' % (cur_nimg // 1000 )) misc.save_pkl((G, D, Gs), pkl) # metrics.run(pkl, run_dir=submit_config.run_dir, num_gpus=submit_config.num_gpus, tf_config=tf_config)

‘Blob’ & ‘Crack’ Artifacts : During training, ‘blobs’ often show up or move around. These blobs appear even late in training on otherwise high-quality images and are unique to StyleGAN (at least, I’ve never seen another GAN whose training artifacts look like the blobs). That they are so large & glaring suggests a weakness in StyleGAN somewhere. The source of the blobs was unclear. If you watch training videos, these blobs seem to gradually morph into new features such as eyes or hair or glasses. I suspect they are part of how StyleGAN ‘creates’ new features, starting with a feature-less blob superimposed at approximately the right location, and gradually refined into something useful. The StyleGAN 2 paper investigated the blob artifacts & found it to be due to the Generator working around a flaw in StyleGAN’s use of AdaIN normalization. Karras et al 2019 note that images without a blob somewhere are severely corrupted; because the blobs are in fact doing something useful, it is unsurprising that the Discriminator doesn’t fix the Generator. StyleGAN 2 changes the AdaIN normalization to eliminate this problem, improving overall quality. If blobs are appearing too often or one wants a final model without any new intrusive blobs, it may help to lower the LR to try to converge to a local optima where the necessary blob is hidden away somewhere unobtrusive. In training anime faces, I have seen additional artifacts, which look like ‘cracks’ or ‘waves’ or elephant skin wrinkles or the sort of fine crazing seen in old paintings or ceramics, which appear toward the end of training on primarily skin or areas of flat color; they happen particularly fast when transfer learning on a small dataset. The only solution I have found so far is to either stop training or get more data. In contrast to the blob artifacts (identified as an architectural problem & fixed in StyleGAN 2), I currently suspect the cracks are a sign of overfitting rather than a peculiarity of normal StyleGAN training, where the G has started trying to memorize noise in the fine detail of pixelation/lines, and so these are a kind of overfitting/mode collapse. (More speculatively: another possible explanation is that the cracks are caused by the StyleGAN D being single-scale rather than multi-scale—as in MSG-GAN and a number of others—and the ‘cracks’ are actually high-frequency noise created by the G in specific patches as adversarial examples to fool the D. They reportedly do not appear in MSG-GAN or StyleGAN 2, which both use multi-scale Ds.)

Gradient Accumulation: ProGAN/StyleGAN’s codebase claims to support gradient accumulation, which is a way to fake large minibatch training (eg n=2048) by not doing the backpropagation update every minibatch, but instead summing the gradients over many minibatches and applying them all at once. This is a useful trick for stabilizing training, and large minibatch NN training can differ qualitatively from small minibatch NN training—BigGAN performance increased with increasingly large minibatches (n=2048) and the authors speculate that this is because such large minibatches mean that the full diversity of the dataset is represented in each ‘minibatch’ so the BigGAN models cannot simply ‘forget’ rarer datapoints which would otherwise not appear for many minibatches in a row, resulting in the GAN pathology of ‘mode dropping’ where some kinds of data just get ignored by both G/D. However, the ProGAN/StyleGAN implementation of gradient accumulation does not resemble that of any other implementation I’ve seen in TensorFlow or PyTorch, and in my own experiments with up to n=4096, I didn’t observe any stabilization or qualitative differences, so I am suspicious the implementation is wrong. Here is what a successful training progression looks like for the anime face StyleGAN: Training montage video of the first 9k iterations of the anime face Style GAN . The anime face model is obsoleted by the Style GAN 2 portrait model The anime face model as of 2019-03-08, trained for 21,980 iterations or ~21m images or ~38 GPU-days, is available for download. (It is still not fully-converged, but the quality is good.)

Sampling Having successfully trained a StyleGAN, now the fun part—generating samples! Psi/“truncation trick” The 𝜓/“truncation trick”(BigGAN discussion, StyleGAN discussion; apparently first introduced by Marchesi 2017) is the most important hyperparameter for all StyleGAN generation. The truncation trick is used at sample generation time but not training time. The idea is to edit the latent vector z, which is a vector of N(0,1), to remove any variables which are above a certain size like 0.5 or 1.0, and resample those. This seems to help by avoiding ‘extreme’ latent values or combinations of latent values which the G is not as good at—a G will not have generated many data points with each latent variable at, say, +1.5SD. The tradeoff is that those are still legitimate areas of the overall latent space which were being used during training to cover parts of the data distribution; so while the latent variables close to the mean of 0 may be the most accurately modeled, they are also only a small part of the space of all possible images. So one can generate latent variables from the full unrestricted N(0,1) distribution for each one, or one can truncate them at something like +1SD or +0.7SD. (Like the discussion of the best distribution for the original latent distribution, there’s no good reason to think that this is an optimal method of doing truncation; there are many alternatives, such as ones penalizing the sum of the variables, either rejecting them or scaling them down, and some appear to work much better than the current truncation trick.) At 𝜓=0, diversity is nil and all faces are a single global average face (a brown-eyed brown-haired schoolgirl, unsurprisingly); at ±0.5 you have a broad range of faces, and by ±1.2, you’ll see tremendous diversity in faces/styles/consistency but also tremendous artifacting & distortion. Where you set your 𝜓 will heavily influence how ‘original’ outputs look. At 𝜓=1.2, they are tremendously original but extremely hit or miss. At 𝜓=0.5 they are consistent but boring. For most of my sampling, I set 𝜓=0.7 which strikes the best balance between craziness/artifacting and quality/diversity. (Personally, I prefer to look at 𝜓=1.2 samples because they are so much more interesting, but if I released those samples, it would give a misleading impression to readers.) Random Samples The StyleGAN repo has a simple script pretrained_example.py to download & generate a single face; in the interests of reproducibility, it hardwires the model and the RNG seed so it will only generate 1 particular face. However, it can be easily adapted to use a local model and (slowly ) generate, say, 1000 sample images with the hyperparameter 𝜓=0.6 (which gives high-quality but not highly-diverse images) which are saved to results/example-{0-999}.png : import os import pickle import numpy as np import PIL.Image import dnnlib import dnnlib.tflib as tflib import config def main(): tflib.init_tf() _G, _D, Gs = pickle.load( open ( "results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl" , "rb" )) Gs.print_layers() for i in range ( 0 , 1000 ): rnd = np.random.RandomState( None ) latents = rnd.randn( 1 , Gs.input_shape[ 1 ]) fmt = dict (func = tflib.convert_images_to_uint8, nchw_to_nhwc = True ) images = Gs.run(latents, None , truncation_psi = 0.6 , randomize_noise = True , output_transform = fmt) os.makedirs(config.result_dir, exist_ok = True ) png_filename = os.path.join(config.result_dir, 'example-' + str (i) + '.png' ) PIL.Image.fromarray(images[ 0 ], 'RGB' ).save(png_filename) if __name__ == "__main__" : main() Karras et al 2018 Figures The figures in Karras et al 2018, demonstrating random samples and aspects of the style noise using the 1024px FFHQ face model (as well as the others), were generated by generate_figures.py . This script needs extensive modifications to work with my 512px anime face; going through the file: the code uses 𝜓=1 truncation, but faces look better with 𝜓=0.7 (several of the functions have truncation_psi= settings but, trickily, Figure 3’s draw_style_mixing_figure has its 𝜓 setting hidden away in the synthesis_kwargs global variable)

settings but, trickily, Figure 3’s has its 𝜓 setting hidden away in the global variable) the loaded model needs to be switched to the anime face model, of course

dimensions must be reduced 1024→512 as appropriate; some ranges are hardcoded and must be reduced for 512px images as well

the truncation trick figure 8 doesn’t show enough faces to give insight into what the latent space is doing so it needs to be expanded to show both more random seeds/faces, and more 𝜓 values

the bedroom/car/cat samples should be disabled The changes I make are as follows: diff --git a/generate_figures.py b/generate_figures.py index 45b68b8..f27af9d 100755 --- a/generate_figures.py +++ b/generate_figures.py @@ -24,16 +24,13 @@ url_bedrooms = 'https://drive.google.com/uc?id=1MOSKeGF0FJcivpBI7s63V9YHloUTO url_cars = 'https://drive.google.com/uc?id=1MJ6iCfNtMIRicihwRorsM3b7mmtmK9c3' # karras2019stylegan-cars-512x384.pkl url_cats = 'https://drive.google.com/uc?id=1MQywl0FNt6lHu8E_EUqnRbviagS7fbiJ' # karras2019stylegan-cats-256x256.pkl -synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8) +synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8, truncation_psi=0.7) _Gs_cache = dict() def load_Gs(url): - if url not in _Gs_cache: - with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f: - _G, _D, Gs = pickle.load(f) - _Gs_cache[url] = Gs - return _Gs_cache[url] + _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb")) + return Gs #---------------------------------------------------------------------------- # Figures 2, 3, 10, 11, 12: Multi-resolution grid of uncurated result images. @@ -85,7 +82,7 @@ def draw_noise_detail_figure(png, Gs, w, h, num_samples, seeds): canvas = PIL.Image.new('RGB', (w * 3, h * len(seeds)), 'white') for row, seed in enumerate(seeds): latents = np.stack([np.random.RandomState(seed).randn(Gs.input_shape[1])] * num_samples) - images = Gs.run(latents, None, truncation_psi=1, **synthesis_kwargs) + images = Gs.run(latents, None, **synthesis_kwargs) canvas.paste(PIL.Image.fromarray(images[0], 'RGB'), (0, row * h)) for i in range(4): crop = PIL.Image.fromarray(images[i + 1], 'RGB') @@ -109,7 +106,7 @@ def draw_noise_components_figure(png, Gs, w, h, seeds, noise_ranges, flips): all_images = [] for noise_range in noise_ranges: tflib.set_vars({var: val * (1 if i in noise_range else 0) for i, (var, val) in enumerate(noise_pairs)}) - range_images = Gsc.run(latents, None, truncation_psi=1, randomize_noise=False, **synthesis_kwargs) + range_images = Gsc.run(latents, None, randomize_noise=False, **synthesis_kwargs) range_images[flips, :, :] = range_images[flips, :, ::-1] all_images.append(list(range_images)) @@ -144,14 +141,11 @@ def draw_truncation_trick_figure(png, Gs, w, h, seeds, psis): def main(): tflib.init_tf() os.makedirs(config.result_dir, exist_ok=True) - draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=1024, ch=1024, rows=3, lods=[0,1,2,2,3,3], seed=5) - draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=1024, h=1024, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,18)]) - draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=1024, h=1024, num_samples=100, seeds=[1157,1012]) - draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1]) - draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[91,388], psis=[1, 0.7, 0.5, 0, -0.5, -1]) - draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure10-uncurated-bedrooms.png'), load_Gs(url_bedrooms), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=0) - draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure11-uncurated-cars.png'), load_Gs(url_cars), cx=0, cy=64, cw=512, ch=384, rows=4, lods=[0,1,2,2,3,3], seed=2) - draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure12-uncurated-cats.png'), load_Gs(url_cats), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=1) + draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=512, ch=512, rows=3, lods=[0,1,2,2,3,3], seed=5) + draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=512, h=512, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,16)]) + draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=512, h=512, num_samples=100, seeds=[1157,1012]) + draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1]) + draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[91,388, 389, 390, 391, 392, 393, 394, 395, 396], psis=[1, 0.7, 0.5, 0.25, 0, -0.25, -0.5, -1]) All this done, we get some fun anime face samples to parallel Karras et al 2018’s figures: Anime face Style GAN , Figure 2, uncurated samples Figure 3, “style mixing” of source/transfer faces, demonstrating control & interpolation (top row=style, left column=target to be styled) Figure 8, the “truncation trick” visualized: 10 random faces, with the range 𝜓 = [1, 0.7, 0.5, 0.25, 0, −0.25, −0.5, −1]—demonstrating the tradeoff between diversity & quality, and the global average face. Videos Training Montage The easiest samples are the progress snapshots generated during training. Over the course of training, their size increases as the effective resolution increases & finer details are generated, and at the end can be quite large (often 14MB each for the anime faces) so doing lossy compression with a tool like pngnq + advpng or converting them to JPG with lowered quality is a good idea. To turn the many snapshots into a training montage video like above, I use FFmpeg on the PNGs: cat $( ls ./results/*faces*/fakes*.png | sort --numeric-sort ) | ffmpeg -framerate 10 \ # show 10 inputs per second -i - # stdin -r 25 # output frame-rate ; frames will be duplicated to pad out to 25FPS -c :v libx264 # x264 for compatibility -pix_fmt yuv420p # force ffmpeg to use a standard colorspace - otherwise PNG colorspace is kept, breaking browsers (!) -crf 33 # adequate high quality -vf "scale=iw/2:ih/2" \ # shrink the image by 2×, the full detail is not necessary & saves space -preset veryslow -tune animation \ # aim for smallest binary possible with animation-tuned settings ./stylegan-facestraining.mp4 Interpolations The original ProGAN repo provided a config for generating interpolation videos, but that was removed in StyleGAN. Cyril Diagne ( @kikko_fr ) implemented a replacement, providing 3 kinds of videos: random_grid_404.mp4 : a standard interpolation video, which is simply a random walk through the latent space, modifying all the variables smoothly and animating it; by default it makes 4 of them arranged 2×2 in the video. Several interpolation videos are show in the examples section. interpolate.mp4 : a ‘coarse’ “style mixing” video; a single ‘source’ face is generated & held constant; a secondary interpolation video, a random walk as before is generated; at each step of the random walk, the ‘coarse’/high-level ‘style’ noise is copied from the random walk to overwrite the source face’s original style noise. For faces, this means that the original face will be modified with all sorts of orientations & facial expressions while still remaining recognizably the original character. (It is the video analog of Karras et al 2018’s Figure 3.) A copy of Diagne’s video.py : import os import pickle import numpy as np import PIL.Image import dnnlib import dnnlib.tflib as tflib import config import scipy def main(): tflib.init_tf() # Load pre-trained network. # url = 'https://drive.google.com/uc?id=1MEGjdvVpUsu1jB4zrXZN7Y4kBBOzizDQ' # with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f: ## NOTE : insert model here: _G, _D, Gs = pickle.load( open ( "results/02047-sgan-faces-2gpu/network-snapshot-013221.pkl" , "rb" )) # _G = Instantaneous snapshot of the generator. Mainly useful for resuming a previous training run. # _D = Instantaneous snapshot of the discriminator. Mainly useful for resuming a previous training run. # Gs = Long-term average of the generator. Yields higher-quality results than the instantaneous snapshot. grid_size = [ 2 , 2 ] image_shrink = 1 image_zoom = 1 duration_sec = 60.0 smoothing_sec = 1.0 mp4_fps = 20 mp4_codec = 'libx264' mp4_bitrate = '5M' random_seed = 404 mp4_file = 'results/random_grid_ %s .mp4' % random_seed minibatch_size = 8 num_frames = int (np.rint(duration_sec * mp4_fps)) random_state = np.random.RandomState(random_seed) # Generate latent vectors shape = [num_frames, np.prod(grid_size)] + Gs.input_shape[ 1 :] # [frame, image, channel, component] all_latents = random_state.randn( * shape).astype(np.float32) import scipy all_latents = scipy.ndimage.gaussian_filter(all_latents, [smoothing_sec * mp4_fps] + [ 0 ] * len (Gs.input_shape), mode = 'wrap' ) all_latents /= np.sqrt(np.mean(np.square(all_latents))) def create_image_grid(images, grid_size = None ): assert images.ndim == 3 or images.ndim == 4 num, img_h, img_w, channels = images.shape if grid_size is not None : grid_w, grid_h = tuple (grid_size) else : grid_w = max ( int (np.ceil(np.sqrt(num))), 1 ) grid_h = max ((num - 1 ) // grid_w + 1 , 1 ) grid = np.zeros([grid_h * img_h, grid_w * img_w, channels], dtype = images.dtype) for idx in range (num): x = (idx % grid_w) * img_w y = (idx // grid_w) * img_h grid[y : y + img_h, x : x + img_w] = images[idx] return grid # Frame generation func for moviepy. def make_frame(t): frame_idx = int (np.clip(np. round (t * mp4_fps), 0 , num_frames - 1 )) latents = all_latents[frame_idx] fmt = dict (func = tflib.convert_images_to_uint8, nchw_to_nhwc = True ) images = Gs.run(latents, None , truncation_psi = 0.7 , randomize_noise = False , output_transform = fmt) grid = create_image_grid(images, grid_size) if image_zoom > 1 : grid = scipy.ndimage.zoom(grid, [image_zoom, image_zoom, 1 ], order = 0 ) if grid.shape[ 2 ] == 1 : grid = grid.repeat( 3 , 2 ) # grayscale => RGB return grid # Generate video. import moviepy.editor video_clip = moviepy.editor.VideoClip(make_frame, duration = duration_sec) video_clip.write_videofile(mp4_file, fps = mp4_fps, codec = mp4_codec, bitrate = mp4_bitrate) # import scipy # coarse duration_sec = 60.0 smoothing_sec = 1.0 mp4_fps = 20 num_frames = int (np.rint(duration_sec * mp4_fps)) random_seed = 500 random_state = np.random.RandomState(random_seed) w = 512 h = 512 #src_seeds = [601] dst_seeds = [ 700 ] style_ranges = ([ 0 ] * 7 + [ range ( 8 , 16 )]) * len (dst_seeds) fmt = dict (func = tflib.convert_images_to_uint8, nchw_to_nhwc = True ) synthesis_kwargs = dict (output_transform = fmt, truncation_psi = 0.7 , minibatch_size = 8 ) shape = [num_frames] + Gs.input_shape[ 1 :] # [frame, image, channel, component] src_latents = random_state.randn( * shape).astype(np.float32) src_latents = scipy.ndimage.gaussian_filter(src_latents, smoothing_sec * mp4_fps, mode = 'wrap' ) src_latents /= np.sqrt(np.mean(np.square(src_latents))) dst_latents = np.stack(np.random.RandomState(seed).randn(Gs.input_shape[ 1 ]) for seed in dst_seeds) src_dlatents = Gs.components.mapping.run(src_latents, None ) # [seed, layer, component] dst_dlatents = Gs.components.mapping.run(dst_latents, None ) # [seed, layer, component] src_images = Gs.components.synthesis.run(src_dlatents, randomize_noise = False , ** synthesis_kwargs) dst_images = Gs.components.synthesis.run(dst_dlatents, randomize_noise = False , ** synthesis_kwargs) canvas = PIL.Image.new( 'RGB' , (w * ( len (dst_seeds) + 1 ), h * 2 ), 'white' ) for col, dst_image in enumerate ( list (dst_images)): canvas.paste(PIL.Image.fromarray(dst_image, 'RGB' ), ((col + 1 ) * h, 0 )) def make_frame(t): frame_idx = int (np.clip(np. round (t * mp4_fps), 0 , num_frames - 1 )) src_image = src_images[frame_idx] canvas.paste(PIL.Image.fromarray(src_image, 'RGB' ), ( 0 , h)) for col, dst_image in enumerate ( list (dst_images)): col_dlatents = np.stack([dst_dlatents[col]]) col_dlatents[:, style_ranges[col]] = src_dlatents[frame_idx, style_ranges[col]] col_images = Gs.components.synthesis.run(col_dlatents, randomize_noise = False , ** synthesis_kwargs) for row, image in enumerate ( list (col_images)): canvas.paste(PIL.Image.fromarray(image, 'RGB' ), ((col + 1 ) * h, (row + 1 ) * w)) return np.array(canvas) # Generate video. import moviepy.editor mp4_file = 'results/interpolate.mp4' mp4_codec = 'libx264' mp4_bitrate = '5M' video_clip = moviepy.editor.VideoClip(make_frame, duration = duration_sec) video_clip.write_videofile(mp4_file, fps = mp4_fps, codec = mp4_codec, bitrate = mp4_bitrate) import scipy duration_sec = 60.0 smoothing_sec = 1.0 mp4_fps = 20 num_frames = int (np.rint(duration_sec * mp4_fps)) random_seed = 503 random_state = np.random.RandomState(random_seed) w = 512 h = 512 style_ranges = [ range ( 6 , 16 )] fmt = dict (func = tflib.convert_images_to_uint8, nchw_to_nhwc = True ) synthesis_kwargs = dict (output_transform = fmt, truncation_psi = 0.7 , minibatch_size = 8 ) shape = [num_frames] + Gs.input_shape[ 1 :] # [frame, image, channel, component] src_latents = random_state.randn( * shape).astype(np.float32) src_latents = scipy.ndimage.gaussian_filter(src_latents, smoothing_sec * mp4_fps, mode = 'wrap' ) src_latents /= np.sqrt(np.mean(np.square(src_latents))) dst_latents = np.stack([random_state.randn(Gs.input_shape[ 1 ])]) src_dlatents = Gs.components.mapping.run(src_latents, None ) # [seed, layer, component] dst_dlatents = Gs.components.mapping.run(dst_latents, None ) # [seed, layer, component] def make_frame(t): frame_idx = int (np.clip(np. round (t * mp4_fps), 0 , num_frames - 1 )) col_dlatents = np.stack([dst_dlatents[ 0 ]]) col_dlatents[:, style_ranges[ 0 ]] = src_dlatents[frame_idx, style_ranges[ 0 ]] col_images = Gs.components.synthesis.run(col_dlatents, randomize_noise = False , ** synthesis_kwargs) return col_images[ 0 ] # Generate video. import moviepy.editor mp4_file = 'results/fine_ %s .mp4' % (random_seed) mp4_codec = 'libx264' mp4_bitrate = '5M' video_clip = moviepy.editor.VideoClip(make_frame, duration = duration_sec) video_clip.write_videofile(mp4_file, fps = mp4_fps, codec = mp4_codec, bitrate = mp4_bitrate) if __name__ == "__main__" : main() ‘Coarse’ style-transfer/interpolation video fine_503.mp4 : a ‘fine’ style mixing video; in this case, the style noise is taken from later on and instead of affecting the global orientation or expression, it affects subtler details like the precise shape of hair strands or hair color or mouths. ‘Fine’ style-transfer/interpolation video Circular interpolations are another interesting kind of interpolation, written by snowy halcy, which instead of random walking around the latent space freely, with large or awkward transitions, instead tries to move around a fixed high-dimensional point doing: “binary search to get the MSE to be roughly the same between frames (slightly brute force, but it looks nicer), and then did that for what is probably close to a sphere or circle in the latent space.” A later version of circular interpolation is in snowy halcy’s face editor repo, but here is the original version cleaned up into a stand-alone program: import dnnlib.tflib as tflib import math import moviepy.editor from numpy import linalg import numpy as np import pickle def main(): tflib.init_tf() _G, _D, Gs = pickle.load( open ( "results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl" , "rb" )) rnd = np.random latents_a = rnd.randn( 1 , Gs.input_shape[ 1 ]) latents_b = rnd.randn( 1 , Gs.input_shape[ 1 ]) latents_c = rnd.randn( 1 , Gs.input_shape[ 1 ]) def circ_generator(latents_interpolate): radius = 40.0 latents_axis_x = (latents_a - latents_b).flatten() / linalg.norm(latents_a - latents_b) latents_axis_y = (latents_a - latents_c).flatten() / linalg.norm(latents_a - latents_c) latents_x = math.sin(math.pi * 2.0 * latents_interpolate) * radius latents_y = math.cos(math.pi * 2.0 * latents_interpolate) * radius latents = latents_a + latents_x * latents_axis_x + latents_y * latents_axis_y return latents def mse(x, y): return (np.square(x - y)).mean() def generate_from_generator_adaptive(gen_func): max_step = 1.0 current_pos = 0.0 change_min = 10.0 change_max = 11.0 fmt = dict (func = tflib.convert_images_to_uint8, nchw_to_nhwc = True ) current_latent = gen_func(current_pos) current_image = Gs.run(current_latent, None , truncation_psi = 0.7 , randomize_noise = False , output_transform = fmt)[ 0 ] array_list = [] video_length = 1.0 while (current_pos < video_length): array_list.append(current_image) lower = current_pos upper = current_pos + max_step current_pos = (upper + lower) / 2.0 current_latent = gen_func(current_pos) current_image = images = Gs.run(current_latent, None , truncation_psi = 0.7 , randomize_noise = False , output_transform = fmt)[ 0 ] current_mse = mse(array_list[ - 1 ], current_image) while current_mse < change_min or current_mse > change_max: if current_mse < change_min: lower = current_pos current_pos = (upper + lower) / 2.0 if current_mse > change_max: upper = current_pos current_pos = (upper + lower) / 2.0 current_latent = gen_func(current_pos) current_image = images = Gs.run(current_latent, None , truncation_psi = 0.7 , randomize_noise = False , output_transform = fmt)[ 0 ] current_mse = mse(array_list[ - 1 ], current_image) print (current_pos, current_mse) return array_list frames = generate_from_generator_adaptive(circ_generator) frames = moviepy.editor.ImageSequenceClip(frames, fps = 30 ) # Generate video. mp4_file = 'results/circular.mp4' mp4_codec = 'libx264' mp4_bitrate = '3M' mp4_fps = 20 frames.write_videofile(mp4_file, fps = mp4_fps, codec = mp4_codec, bitrate = mp4_bitrate) if __name__ == "__main__" : main() ‘Circular’ interpolation video An interesting use of interpolations is Kyle McLean’s “Waifu Synthesis” video: a singing anime video mashing up StyleGAN anime faces + GPT-2 lyrics + Project Magenta music.

StyleGAN 2 StyleGAN 2 (source, video), eliminates blob artifacts, adds a native encoding ‘projection’ feature for editing, simplifies the runtime by scrapping progressive growing in favor of MSG-GAN-like multi-scale architecture, & has higher overall quality—but similar total training time/requirements I used a 512px anime portrait S2 model trained by Aaron Gokaslan to create ThisWaifuDoesNotExist v3: 100 random sample images from the Style GAN 2 anime portrait faces in TWDNE v3, arranged in a 10×10 grid. Training samples: Iteration #24,303 of Gokaslan’s training of an anime portrait Style GAN 2 model (training samples) The model was trained to iteration #24,664 for >2 weeks on 4 Nvidia 2080ti GPUs at 35–70s per 1k images. The Tensorflow S2 model is available for download (320MB). (PyTorch & Onnx versions have been made by Anton using a custom repo.) This model can be used in Google Colab (demonstration notebook, although it seems it may pull in an older S2 model) & the model can also be used with the S2 codebase for encoding anime faces. Running S2 Because of the optimizations, which requires custom local compilation of CUDA code for maximum efficiency, getting S2 running can be more challenging than getting S1 running. No TensorFlow 2 compatibility: the TF version must be 1.14/1.15. Trying to run with TF 2 will give errors like: TypeError: int() argument must be a string, a bytes-like object or a number, not 'Tensor' . I ran into cuDNN compatibility problems with TF 1.15 (which requires cuDNN >7.6.0, 2019-05-20, for CUDA 10.0), which gave errors like this: ...[2020-01-11 23:10:35.234784: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.4.2 but source was compiled with: 7.6.0. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration... But then with 1.14, the tpu-estimator library was not found! (I ultimately took the risk of upgrading my installation with libcudnn7_7.6.0.64-1+cuda10.0_amd64.deb , and thankfully, that worked and did not seem to break anything else.)

Getting the entire pipeline to compile the custom ops in a Conda environment was annoying so Gokaslan tweaked it to use 1.14 on Linux, used cudatoolkit-dev from Conda Forge, and changed the build script to use gcc-7 (since gcc-8 was unsupported)

one issue with TensorFlow 1.14 is you need to force allow_growth or it will error out on Nvidia 2080tis

config name change: train.py has been renamed (again) to run_training.py

buggy learning rates: S2 (but not S1) accidentally uses the same LR for both G & D; either fix this or keep it in mind when doing LR tuning—changes to D_lrate do nothing!

n=1 minibatch problems: S2 is not a large NN so it can be trained on low-end GPUs; however, the S2 code make an unnecessary assumption that n≥2; to fix this in training/loss.py (fixed in Shawn Presser’s TPU/self-attention oriented fork): @@ -157,9 +157,8 @@ def G_logistic_ns_pathreg(G, D, opt, training_set, minibatch_size, pl_minibatch_ with tf.name_scope('PathReg'): # Evaluate the regularization term using a smaller minibatch to conserve memory. if pl_minibatch_shrink > 1 and minibatch_size > 1: assert minibatch_size % pl_minibatch_shrink == 0 pl_minibatch = minibatch_size // pl_minibatch_shrink if pl_minibatch_shrink > 1: pl_minibatch = tf.maximum(1, minibatch_size // pl_minibatch_shrink) pl_latents = tf.random_normal([pl_minibatch] + G.input_shapes[0][1:]) pl_labels = training_set.get_random_labels_tf(pl_minibatch) fake_images_out, fake_dlatents_out = G.get_output_for(pl_latents, pl_labels, is_training=True, return_dlatents=True)

S2 has some sort of memory leak, possibly related to the FID evaluations, requiring regular restarts, like putting it into a loop Once S2 was running, Gokaslan trained the S2 portrait model with generally default hyperparameters.

Future Work Some open questions about StyleGAN’s architecture & training dynamics: is progressive growing still necessary with Style GAN ? (Style GAN 2 implies that it is not, as it uses a MSG - GAN -like approach)

with Style ? (Style 2 implies that it is not, as it uses a - -like approach) are 8×512 FC layers necessary? (Preliminary Big GAN work suggests that they are not necessary for Big GAN .)

work suggests that they are not necessary for Big .) what are the wrinkly-line/cracks *noise artifacts** which appear at the end of training?

how does Style GAN compare to Big GAN in final quality? Further possible work: exploration of “ curriculum learning ”: can training be sped up by training to convergence on small n and then periodically expanding the dataset?

bootstrapping image generation by starting with a seed corpus, generating many random samples, selecting the best by hand, and retraining; eg expand a corpus of a specific character, or explore ‘hybrid’ corpuses which mix A/B images & one then selects for images which look most A+B-ish

improved transfer learning scripts to edit trained models so 512px pretrained models can be promoted to work with 1024px images and vice versa

better Danbooru tagger CNN for providing classification embeddings for various purposes, particularly FID loss monitoring, minibatch discrimination/auxiliary loss, and style transfer for creating a ‘StyleDanbooru’ with a StyleDanbooru, I am curious if that can be used as a particularly Powerful Form Of Data Augmentation for small n character datasets, and whether it leads to a reversal of training dynamics with edges coming before colors/textures—it’s possible that a StyleDanbooru could make many GAN architectures, not just Style GAN , stable to train on anime/illustration datasets

borrowing architectural enhancements from BigGAN : self-attention layers, spectral norm regularization, large-minibatch training, and a rectified Gaussian distribution for the latent vector z

text→image conditional GAN architecture (à la StackGAN): This would take the text tag descriptions of each image compiled by Danbooru users and use those as inputs to StyleGAN, which, should it work, would mean you could create arbitrary anime images simply by typing in a string like 1_boy samurai facing_viewer red_hair clouds sword armor blood etc. This should also, by providing rich semantic descriptions of each image, make training faster & stabler and converge to higher final quality.

meta-learning for few-shot face or character or artist imitation (eg Set-CGAN or FIGR or perhaps FUNIT, or Noguchi & Harada 2019—the last of which achieves few-shot learning with samples of n=25 TWDNE StyleGAN anime faces) ImageNet StyleGAN As part of experiments in scaling up StyleGAN 2, using TFRC research credits, we ran StyleGAN on large-scale datasets including Danbooru2019, ImageNet, and subsets of the Flickr YFCC100M dataset. Despite running for millions of images, no S2 run ever achieved remotely the realism of S2 on FFHQ or BigGAN on ImageNet: while the textures could be surprisingly good, the semantic global structure never came together, with glaring flaws—there would be too many heads, or they would be detached from bodies, etc. Aaron Gokaslan took the time to compute the FID on ImageNet, estimating a terrible score of FID ~120. (Higher=worse; for comparison, BigGAN with EvoNorm can be as good as FID ~7, and regular BigGAN typically surpasses FID 120 within a few thousand iterations.) Even experiments in increasing the S2 model size up to ~1GB (by increasing the feature map multiplier) improved quality relatively modestly, and showed no signs of ever approaching BigGAN-level quality. We concluded that StyleGAN is in fact fundamentally limited as a GAN, trading off stability for power, and switched over to BigGAN work. For those interested, we provide our 512px ImageNet S2 (step 1,394,688): rsync --verbose rsync://78.46.86.149:873/biggan/2020-04-07-shawwn-stylegan-imagenet-512px-run52-1394688.pkl.xz ./ Shawn Presser, S2 ImageNet interpolation video from partway through training (~45 hours on a TPU v3-512, 3k images/s) Danbooru2019+e621 256px BigGAN As part of testing our modifications to compare_gan , including sampling from multiple datasets to increase n and using flood loss to stabilize it and adding an additional (crude, limited) kind of self-supervised SimCLR loss to the D, we trained several 256px BigGANs, initially on Danbooru2019 SFW but then adding in the TWDNE portraits & e621/e621-portraits partway through training. This destabilized the models greatly, but the flood loss appears to have stopped divergence and they gradually recovered. Run #39 did somewhat better than run #40; the self-supervised variants never recovered. This indicated to us that our self-supervised loss needed heavy revision (as indeed it did), and that flood loss was more valuable than expected, and we investigated it further; the important part appears—for GANs, anyway—to be the stop-loss part, halting training of G/D when it gets ‘too good’. Freezing models is an old GAN trick which is mostly not used post-WGAN, but appears useful for BigGAN, perhaps because of the spiky loss curve, especially early in training. We ran it for 607,250 iterations on a TPUv3-256 pod until 2020-05-15. Config: {"dataset.name": "images_256", "resnet_biggan.Discriminator.blocks_with_attention": "B2", "resnet_biggan.Discriminator.ch": 96, "resnet_biggan.Generator.blocks_with_attention": "B5", "resnet_biggan.Generator.ch": 96, "resnet_biggan.Generator.plain_tanh": false, "ModularGAN.d_lr": 0.0005, "ModularGAN.d_lr_mul": 3.0, "ModularGAN.ema_start_step": 4000, "ModularGAN.g_lr": 6.66e-05, "ModularGAN.g_lr_mul": 1.0, "options.batch_size": 2048, "options.d_flood": 0.2, "options.datasets": "gs://XYZ-euw4a/datasets/danbooru2019-s/danbooru2019-s-0*,gs://XYZ-euw4a/datasets/e621-s/e621-s-0*, gs://XYZ-euw4a/datasets/portraits/portraits-0*,gs://XYZ-euw4a/datasets/e621-portraits-s-512/e621-portraits-s-512-0*", "options.g_flood": 0.05, "options.labels": "", "options.random_labels": true, "options.z_dim": 140, "run_config.experimental_host_call_every_n_steps": 50, "run_config.keep_checkpoint_every_n_hours": 0.5, "standardize_batch.use_cross_replica_mean": true, "TpuSummaries.save_image_steps": 50, "TpuSummaries.save_summary_steps": 1} 90 random EMA samples (untruncated) from the 256px Big GAN trained on Danbooru2019/anime-portraits/e621/e621-portraits. The model is available for download: rsync --verbose rsync://78.46.86.149:873/biggan/2020-05-18-spresser-biggan-256px-danbooruplus-run39-607250.tar.xz ./