The previous paper finds tiny perturbations in the entire image to make the whole network misclassify. The authors of this paper take it further. They argue that modifying the entire image is not required.

Instead, they modify only a small portion of the image such that the modified image is predicted as the wrong class.

Formally, for a given input x , the probability of x belonging to class t is f_x(). The task at hand is denoted in Equation 2.0.

Equation 2.0

where, adv is the adversarial class to optimise for.

Here, e(x) is the more interesting term. It is the adversarial data (similar to the one in the previous paper) that is added to the input. However, in this case, this e() has the following constraint:

Equation 2.1

This just means that the number of elements in the vector x has to be less than L , which is a tuneable parameter. (The ||0 means 0th norm, which is the number of non zero elements in a vector). The maximum value of the elements produced by e() is constrained , similar to the previous paper.

How to find the right adversarial vector

The previous paper used backprop in order to optimize for the right values of the adversarial input. In my opinion, giving access to the model’s gradients is unfair , as its essentially possible to exactly know how the model ‘thinks’. Therefore optimizing for the adversarial inputs becomes easy.

In this paper, the authors decided to not use those. Instead they used Differential Evolution. It is a method that takes some samples, generates ‘children’ from samples. Then from those children, it only keeps the ones that are better than the parent samples. This method then goes on, and new children are generated.

This does not give any information about gradients, and finding the right values for the adversarial input can be done without having any knowledge of how the model works (it will even work if the model is not differentiable , unlike the previous method)

This are the results reported by the authors:

Figure 2.0 One pixel attack results. Only one pixel is modified to make the classifier output something wrong. The class in parenthesis is the classifier’s output after the noise was added.

For CIFAR 10, the value of L was kept to 1, which means only one pixel was allowed to be modified. And the adversarial was able to really fool the classifier to predict very different classes.