Adversarial attack are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake.

Let’s take an example

Let’s say i have created an image classifier that gives name of the object based on its image. so if we give this as input image

Then we will get output as “Panda” which is correct.

But what if i tell you that only adding some specific noise to the image, i can fool the model classifier to think of it as an another object. Such as

This is a classic example of adversarial attack.

Adversarial examples have the potential to be dangerous. For example, attackers could target autonomous vehicles by using stickers or paint to create an adversarial stop sign that the vehicle would interpret as a ‘yield’ or other sign, as discussed in Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples.

Reinforcement learning agents can also be manipulated by adversarial examples, according to new research on Adversarial Attacks on Neural Network Policies, The research shows that widely-used RL algorithms, such as DQN, TRPO, and A3C, are vulnerable to adversarial inputs. These can lead to degraded performance even in the presence of perturbations too subtle to be perceived by a human, causing an agent to move a pong paddle down when it should go up, or interfering with its ability to spot enemies in Seaquest.

When we think about the study of AI safety, we usually think about some of the most difficult problems in that field — how can we ensure that sophisticated reinforcement learning agents that are significantly more intelligent than human beings behave in ways that their designers intended?

Adversarial examples show us that even simple modern algorithms, for both supervised and reinforcement learning, can already behave in surprising ways that we do not intend.

Adversarial examples are hard to defend against because it is difficult to construct a theoretical model of the adversarial example crafting process. Adversarial examples are solutions to an optimization problem that is non-linear and non-convex for many ML models, including neural networks. Because we don’t have good theoretical tools for describing the solutions to these complicated optimization problems, it is very hard to make any kind of theoretical argument that a defense will rule out a set of adversarial examples.

Adversarial examples are also hard to defend against because they require machine learning models to produce good outputs for every possible input. Most of the time, machine learning models work very well but only work on a very small amount of all the many possible inputs they might encounter.

Adversarial examples show that many modern machine learning algorithms can be broken in surprising ways. These failures of machine learning demonstrate that even simple algorithms can behave very differently from what their designers intend.

Okay, let me ask you a question:

So how many pixels we need to change to fool our neural network?

Unfortunately, the answer is one.

In this paper it is mentioned that any neural network can be defeated by changing one pixel from the image.

By changing only one pixel in an image that depicts a horse, the AI will be 99.9% sure that we are seeing a frog. A ship can also be disguised as a car or amusingly, almost anything can be seen as an airplane.

So how can we perform such an attack? As you can see these neural networks typically don’t provide a class directly, but a bunch of confidence values. What does this mean exactly?

The confidence values denote how sure the network is that we see a labrador or a tiger cat. To come to a decision, we usually look at all of these confidence values and choose the object type that has the highest confidence. Now clearly, we have to know which pixel position to choose and what color it should be to perform a successful attack. We can do this by performing a bunch of random changes to the image and checking how each of these changes performed in decreasing the confidence of the network in the appropriate class.

After this, we filter out the bad ones and continue our search around the most promising candidates. This process we refer to as differential evolution, and if we perform it properly, in the end, the confidence value for the correct class will be so low that a different class will take over. If this happens, the network has been defeated.

Now, note that this also means that we have to be able to look into the neural network and have access to the confidence values. There is also plenty of research works on training more robust neural networks that can withstand as many adversarial changes to the inputs as possible.