Being a fashion model isn’t as easy as it looks. Good looks go a long way, but presenting an outfit in the best possible light also requires a an exhaustive awareness of poses and the patience to perform for hours under hot lights in the studio or on the catwalk. AI has taken on a wide range of challenges in the last several years, and now machine learning researchers have set their sights on fashion models.

A new research paper from Berlin-based unicorn fashion and technology company Zalando uses generative adversarial networks (GANs) to produce high-resolution images of virtual fashion models ready to model clothes of any style.

The researchers set out to create an AI system capable of transferring customizable outfits and body poses from one fashion model to another. They used an architecture based mostly on StyleGAN, a technique introduced by NVIDIA in 2018 that enables intuitive, scale-specific generational control.

The researchers built a proprietary image dataset with about 380K images in 1024x768 pixel resolution. Each image has a fashion model holding a pose and wearing an outfit comprising up to six pieces of clothing and accessories. A deep pose estimator extracts 16 key pose points from each pose.

Samples from Zalando’s dataset (red markers represent the extracted key points)

Researchers used both unconditional and conditional StyleGANs. The unconditional StyleGAN model contains 18 generator layers for receiving an affinely transformed copy of the style vector for adaptive instance normalization. If an outfit does not have an article in a particular semantic category, an empty grey field will appear. The conditional StyleGAN meanwhile is modified with an embedding network.

Flowcharts of the (a) unconditional and (b) conditional GANs.

The conditional and unconditional StyleGANs were trained for four weeks on four NVIDIA V100 GPUs. The unconditional GAN model generated realistic images of model poses and clothing articles as shown below.

Transferring outfit colours and body poses to different generated models.

The conditional GAN meanwhile captured and reproduced fashion models with a variety of body types and outfits, as shown below.

Different outfits used to generate various model images, eg. the jacket from outfit #1 is added to outfit #2 to customize the visualization.

The unconditional GAN had a better FID (Frechet Inception Distance score, a measure of similarity between two datasets of images, a smaller number means better result). It’s suggested the conditional GAN produced lower image quality because it bore the additional task of checking the input outfit of the conditional discriminator, resulting in a trade-off between image quality and generated outfit and pose controllability.

Models’ FID Scores

Today’s fashion brands and e-commerce platforms are keen to personalize their apparel shopping experience, and this research introduces an innovative way to simplify the visualization of fashion products on shoppers. It’s also possible that this approach could be combined with Deepfake AI techniques to synthesize the customer’s actual face onto such visualizations to further personalize the virtual fashion shopping experience.

The paper Generating High-Resolution Fashion Model Images Wearing Custom Outfits is on arXiv.