A Google Brain team led by “Godfather of Deep Learning” Geoffrey Hinton has proposed a new way to accurately detect black box and white box FGSM and BIM attacks.

DARCCC (Detecting Adversaries by Reconstruction from Class Conditional Capsules) is a technique which uses a similarity metric to compare reconstructed images with an original input image to identify whether it was an adversarial image, and further detects whether the system was attacked.

An adversarial image attack involves intentionally deceiving an image recognition engine. A typical example is shown below, where the attacker adds an imperceptible vector to the input image, which fools the image classifier into recognizing a panda as a gibbon ape.

from the Ian Goodfellow et al paper Explaining and Harnessing Adversarial Examples

The adversarial attack itself is subtle and difficult to detect, but the technique also has weaknesses. In the above panda-gibbon classification example, even though the system classifies the panda image as a gibbon, the image still does not look like a gibbon. Google Brain researchers exploited this difference to sort out wrongly classified images.

When the model is recognizing images, it will output a “reconstruction image” in addition to the classification label (e.g. “panda” or “gibbon”). The reconstruction image is class conditional and so will reflect features of the classification label. If adversarial images have “fooled” the initial classification system, the reconstruction image will appear more dissimilar to the input than in the case of real images.

Reconstruction images differ more from the input when under attack

To apply the DARCCC technique to an image recognition model, a reconstruction error threshold needs to be defined for the validation set. Whenever an image’s reconstruction error surpasses the threshold it will be marked as an adversarial image. Accordingly, the algorithm can then determine whether the classification system has been attacked.

Histogram showing distances between the reconstruction and the input, for real and adversarial data for MNIST

The authors extended this attack detection technique to three image classification models (Capsule, CNN+R, and Masked CNN+R), and selected three image datasets (MNIST, Fashion-MNIST, and SVHN) for validation.

Three common white-box attack methods were tested, and it was found that the DARCCC resisted both FGSM (Fast Gradient Sign Method) and BIM (Basic Iterative Methods) algorithms in white-box attacks. DARCCC however was defeated by the more powerful R-BIM (Reconstructive BIM) attacks, which can calculate reconstruction loss and attack the model iteratively.

The paper DARCCC: Detecting Adversaries by Reconstruction from Class Conditional Capsules will be presented at the NeurIPS 2018 Workshop on Security next month in Montréal. Synced will be reporting from the conference throughout the week.