This is the first part of our special feature series on Deepfakes, exploring the latest developments and implications in this nascent field of AI. We will be covering detailed implementations on generation and countering strategies in future articles, please stay tuned to GradientCrescent to learn more.

Introduction

1 -to-1 facial translation via Deepfakes, as exemplified by Obama (L) and Peele (R)

“Deepfakes” , or fake composite videos generated through deep learning, have come into public discourse due to the strong possibility of targeted unethical applications of their use. Last week saw the rise and fall of the “DeepNude” application, which allowed users to visualize how females would look undressed, gained notoriety for its unethical applications and was appropriately removed as a result. The realism of synthetic mimicries of Barack Obama, Joe Rogan, and Mark Zuckerberg deepfakes have forced policymakers to begin to analyze the legal and ethical implications of Deepfake technology.

Don’t forget the employment implications too. Looking at you, Samberg.

As a society, we are entering a time where an objective truth is becoming harder to define due to an increase in the veracity and authenticity of faked evidence. Their presence could dramatically affect what is considered acceptable validation, with implications covering criminal and civil law, democratic electrons, and influencing the context public discourse.

While the manipulation of audiovisual evidence to propagate ones agenda has been around since the earliest days of media (with global ramifications — the Gleiwitz incident by Nazi Germany is a great example) the adoption of social media as a communication platform has made individuals particularly vulnerable to such technologies. Vanilla social networks have already been suggested to increase social isolation, decrease attention-span, opinion radicalization, and the promotion of silo-formation behavior — and such subjects would now be more easily influenced messages tailored to them by familiar authorities through deepfakes.

Moreover, a new avenue of plausible deniability would become open to public figures with potentially historic effects — notable incidents such as the Mitt Romney 47% Speech and the Watergate scandal could be swiftly denied in an age where no evidence could be considered 100% authentic. Suffice to say, the importance of developing countering strategies of deepfakes cannot be understated. In this article, we’ll cover the elements of deepfake generation and iterate through the latest approaches in literature to detect them.

Deepfake Generation: A Brief Explanation

To better understand the type of problem and how to best address it, let’s go over how a simple facial transfer deepfake is created in layman’s terms. For our explanation, we’ll only be considering the visual component of a deepfake. Note that this is a general overview, a more code-based explanation will be covered in a separate article.

At their core, deepfakes are simply a series of image frames generated through the use of autoencoders, or networks designed to downsample and restore images,a topic we’ve covered previously. Let’s illustrate the principle of action with an example of two subjects, A and B.

Assume that these two subjects are public figures, and possess thousands of frames of image references for use. An autoencoder is assigned to each, which is designed to encode images into lower-dimensional feature maps, and then reconstructed into the original image using a decoder. Note that the encoder captures a wide array of facial feature details of an image such as angle, skin tone, facial expression, and lighting.

Once the model is trained with enough frames of a person covering a wide array of parameters, we are able to reconstruct an example of a subject performing a similarly wide array of facial configurations. Naturally, training is a time- and resource-intensive process.

Deepfake facial translation, as illustrated by Lui et al.

Once the autoencoders have been trained, we can start to create our deepfake outputs. To map the face of A onto subject B, an input sequence of frames of the face of a subject A is fed into its encoder, generating a resultant feature map, which you can think of as the blueprint of A’s face. During reconstruction, we then use the decoder of the subject B to reconstruct the image from the feature map of A, generating composite of subject B’s body and background with the facial features of subject A. As the decoder has been trained on data from subject B, we are reconstructing person B, but with the context of A — essentially faking the appearance of A. Naturally, one could reverse the order of components in this process to generate the opposite deepfake as well.

Note that we’ve described the training of a single iteration of our model. These generated composite results need to be evaluated and tweaked to achieve improved realism. While the results of our model could be evaluated manually, the use of a GAN’s can allow for self-supervised training in order to alter parameters to generate the more realistic images through a discriminator component. To learn more about GAN’s, we encourage the reader to consult our original work on autoencoders, but essentially, you can think of the duo as a counterfeiter and a corrupt cop comparing notes, aiming to improve the counterfeiter’s skill until an example passable for an original currency is made.

Beyond simple facial-transfer examples of Deepfakes, intra-subject visual mimicking using an autoencoder/GAN architectures was perfected at SIGGRAPH 2018 by researchers from Stanford University.

Here, an autoencoder-based transformation network was paired with a GAN-based discriminator to create a network architecture that was trained on a dataset of 2000 frames of a public subjects such as Vladimir Putin or Barack Obama. While previous attempts in the fields had been made, such attempts suffered from severe noise- and warping-based artifacts. By presenting the data input tensor as a spacetime sequence to deliver temporal information , the researchers were able to generate a 1-to-1 translation capability in converting their own facial expressions and pose to that of the target subject free of any artifact issues.

Countering Deepfakes: the State of the Art

So given the potential implications of this problem, what options are available to us? Let’s go over the latest tools available to us.

Visual Inspection

Within the AI arms race, Visual inspection is generally the first and most accessible line of defense. We’ve listed some of the details to look out for below — aspects more specific to deepfakes rather than static images are highlighted. Note that a lot of this advice has already been rendered unreliable by advancements in deepfake generation, and has been included for posterity.

Flickering artifacts

One of the issues with deepfake-based facial transfer is the transitions between face and other part’s of the head, such as the neck, or hair, which can be visualized in flickering artifacts where different components meet. Such artifacts can be masked with noise-based blurring, and are thus unreliable.

Matching body aspects

Until now, most deepfakes have consisted of facial substitutions, and so discrepancies with the rest of the body in terms of size, color, or proportions could raise suspicions about their authenticity. However, such advice only applies for inter-subject deepfakes, as intra-subject based composites would not exhibit such discrepancies

Clip length

Deepfake generation is extremely time and resource-intensive, meaning that that the large number of publicized examples tend to be short in length. If the content seems implausible, and is presented without context to an audience, there may be a good chance that it’s fake. However, such limitations would not be a significant issue for state-level actors.

Lack of blinking

One of the earliest and most notable discrepancies with deepfakes was the lack of blinking in the generated examples, which was attributed to the an over representation of non-blinking data in the training data — an issue that was rapidly fixed, making this tip obsolete.

Oral details

Autoencoder/GAN combinations are still struggling with modelling teeth and oral orifices correctly, often generating misshapen or blurry examples. Research into overcoming this aspect is already underway, and deepfake creators may choose to use lower-resolution examples reminiscent of smartphone capture data to mask this inadequacy in the meantime.

Facial asymmetry

For static composites made via GANs, facial asymmetry in detail can be indication of a faked image. You may observe the frames of glasses on one side of the face, different earrings between two ears, or textural differences between both sides of the face.

Hair details

Hair has traditionally been known to be difficult for models to render properly. Disconnected strands or strands that are too stiff or streaked can be problematic, particularly when considering frame-to-frame differences within video.

Statistical analysis

You may have noticed that most of the tips above rely on the poor resources and skills of the deepfaker rather than any obvious weakness within the deepfake itself, making visual inspection extremely unreliable. A more statistical approach, known as photo response non uniformity or PRNU, was successfully demonstrated by Koopman et al. This PRNU pattern of a digital image is a unique noise pattern created by small factory defects within the light sensitive sensors of a digital camera, and can be considered to be a digital fingerprint. Any attempt to manipulate an image, either through neural networks or manual editing, would alter the local PRNU pattern in the video frames. However, as the study utilized a statistically insignificant dataset (n <10), it remains to be seen if such an approach is scaleable.

Deep Learning

Background comparison

Established scale-able AI-based counter-deepfake efforts have been demonstrated by the online GIF repository website Gyfcat. The website uses facial recognition models to spot inconsistencies in the rendering of the facial area of an uploaded video. Suspected fake videos are further analyzed by masking the facial area searching the database for a similar video with the same background and body. Similiar references are then inspected for facial similarities to conclude on the authenticity of the video. However, this approach possesses several weaknesses: backgrounds could be fully faked, and composites made from completely new footage would not be detected, facilitating the need for large databases of video footage.

Temporal pattern analysis

As a user’s behavior can be best defined as a time-series of movements, sequence-based data has been explored as an authentication approach. Researchers from the University of Purdue combined a CNN together with an LSTM in order to process temporal data sequences. By passing each frame of a video into a CNN and generating a sequence of feature maps for the LSTM, the network was able to learn specific movement-based behaviors of its subjects, achieving a test accuracy of over 97 % on only 40 frames of data.

Facial artifacts

Recently, researchers from UC Berkely teamed up with Adobe to create a tool capable of detecting manually manipulated (or photoshopped) images by detecting low-level facial warping. By training a CNN on examples of images manipulated using the popular Face Aware Liquify feature, a validation accuracy of 99% could be achieved versus that of a human observer (53%). However, while such results are encouraging, the lack of GAN-generated examples naturally means that such a network is only capable of detecting manually-manipulated images, and are hence poorly suited for combating deepfakes.

Systematic facial warping in deepfakes was identified as a systematic pattern in deepfakes by researchers at the University of Albany. Their approach relied on the tendency of deepfake algorithms to create lower resolution outputs of fixed sizes for computational efficiency. These images would then undergo upscaling and affine transformation such as scaling, rotation, or shearing to match the poses of the target faces that they will replace in the synthetic composite. The difference in resolutions and translations was found to generate artifacts that could be detected via CNNs.

Mesoscopic analysis

Researchers from the National Institute of Informatics in Tokyo have demonstrated how deep neural networks engineered with an artificial low layer count are capable of detecting minute discrepancies that are observed in deepfakes at high computational efficiency, achieving an accuracy of roughly 90%. The performance of simple networks relies on the fact that a low layer count facilitates the identification of smaller, more elementary patterns with reduced number of convolutions of the image during a forward pass.

Pose estimation

Researchers at the University of Albany have demonstrated that systematic differences between simple deepfake outputs and the target face pose could be detected using landmark points and classified using simple support vector machine models. However, the performance of the model was found to be poor with blurrier images, as landmark assignment would similarly be degraded.

Proactive Authentication

An alternative, and somewhat radical approach is to counter misinformation with more information by enforcing validation at the creation of data. Point-of-origin-based authentication approaches have been proposed, including image-based internal hash-based metadata as well as blockchain based content validation. A more radical approach of an artificial “authenticated alibi” was floated by AI law specialist Danielle Citron, where a public figure would constantly upload their location together with a livestream for verification. Naturally, constant streaming would be extremely strenuous on one’s mental health, while also being excessively dystopian.

Conclusions

While it would seem like we’re in a new information cold war with multiple actors and no SALT treaties to regulate arsenals, the rise of deepfakes is not an entirely morbid phenomenon. A combination of deepfakes and advanced NLP models such as OpenAI’s GPT-2 could open the way to creating more human-like digital assistants, with applications in the service and healthcare industries industry. Law and first-response personnel could be trained in completely procedurally generated scripted scenarios in virtual reality at high cost-effectiveness.

In the long term, we could potentially develop the capability to upload a mimicry of our consciousness into the cloud, creating interactive avatars long after we’ve passed on, without the need of real-life actors. To this author, it is both frightening and exciting that the foundations of such technologies are within our grasp. Early implementations of such ideas have already been demonstrated by Samsung’s AI team, in the form of “living” portraits of famous individuals such as Marilyne Monroe and Albert Einstein.

Reliving the past: insightful or disrespectful? That’s up to us decide.

As a society, we stand at a new crossroad for truth. In the end, the solution to deepfakes may rest on societal adaptation — whether we can develop an enhanced capability to critically scrutinize, process and evaluate information in an age of information overload, decentralized sources, and decreased attention spans.

In our next article, we’ll cover how to generate a simple deepfake in detail. To stay updated, please consider subscribing to GradientCrescent!

References

Rodriguez et al., Detection of Deepfake Video Manipulation, University of Amsterdam & Netherlands Forensic Institute

Guera et al., Deepfake Video Detection Using Recurrent Neural Networks, University of Purdue

Li et al, Exposing DeepFake Videos By Detecting Face Warping Artifacts, University of Albany

Ronit Chawla, Deepfakes: How a pervert shook the world, Delhi Modern School

Yang et al., Exposing Deep Fakes Using Inconsistent Head Poses, University of Albany

Kim et al., Deep Video Portraits, Stanford University