The Internet is woven into our everyday lives. We access massive amounts of data through our laptops, smartphones and tablets. This free flow of information however has prompted attempts to filter content which may not be appropriate for example for young people. One such new effort from Brazil puts virtual bikinis on nudes.

Many AI researchers are working on representation learning approaches to filter specific content for sensitive users. Adult content recognition with deep neural networks (ACORDE) for example can distinguish between sensitive and non-sensitive content, providing a binary classification result to completely block content containing nudity. However, such a solution can compromise user experience with digital media.

An alternative approach is censoring only sensitive regions of an image. In late June, a group of researchers in Brazil published the paper Seamless Nudity Censorship: An Image-to-Image Translation Approach based on Adversarial Training at IJCNN 2018. This is the first time an automatic method was proposed to implicitly locate and mask sensitive regions in images while preserving image semantics.

This scheme is regarded as a more practical approach, as it delivers overall content information in a non-intrusive manner by masking only sensitive content. The technique can also save time spent on manual annotation of body parts.

Techniques for censoring sensitive regions of an image. (a) to ©: manual strategies commonly used for localized censorship. (d): a fully-automatic seamless censoring approach using the novel unpaired image-to-image translation technique described in the paper.

The authors created a conditional cycle-consistent adversarial network (CycleGAN) consisting of two types of models based on neural networks: the mapping generator G to produce realistic-looking images to fool the discriminator; and the discriminator D to discern between real images from the training dataset and synthetic images from the generator using unpaired training samples and conditional inputs for task accomplishment.

Researchers combined the highly stable least square adversarial loss function with the cycle-consistency loss function to optimize the model. The authors also tested two popular architectures for the generators, namely 9-Blocks ResNet and U-Net 256, along with a standard simple discriminator architecture with increasing filtration in its last convolutional layers.

Existing sensitive content identification training datasets usually contain images that are totally irrelevant to adult content, and thus unsuitable for this image-to-image translation model. The authors compiled their own dataset from the Internet, including nude women (921 images for training and 103 for test) and women wearing bikinis (1,044 training images and 117 test images). The models were trained at 256 × 256 pixel resolution.

During model training the authors noticed that background noise could adversely and significantly affect the speed of learning and quality of produced images. To solve this problem, they used Mask R-CNN, which is a state-of-the-art approach for semantic and instance segmentation, to identify, cut and place subjects against white backgrounds.

After model optimization, this image-to-image translation framework was able to identify the difference between nude and clothed women; while the generators also successfully learned to perform their content selecting and processing tasks.

Results after training on the original dataset. Row (A): real images (manually censored for publication); Row (B): results using the 9-Blocks ResNet generator; Row ©: results using the U-Net 256 generator. (Blurring applied to unsatisfactory results.)

Even though both architectures are visually impressive, the 9-Blocks ResNet Generator, which consists of an autoencoder that applies residual connections between bottleneck layers along with the implementation of ReLU as activation and instance normalization after the convolutions, seems more suitable for seamless nudity censorship.

The ResNet-based model (Row B) consistently outperformed U-Net (Row C), which had difficulties appropriately positioning realistic bikinis onto sensitive parts and somehow also distorted the original images.

Moving forward, the authors intend to analyze the impact of various network architectures and loss functions on the generated images. They would also like to embed the method into a browser application.

Source: Synced China