Simply put, “adversarial examples” are small image modifications designed to fool an image classifier. Although modern neural networks are trained to defend against common adversarial distortions — for example Lp-bounded distortions — these networks can still be fairly easily fooled by adversarial attacks that use other types of input distortions.

To ramp up the robustness of neural networks, researchers from OpenAI have introduced a novel method that evaluates how well a neural network classifier performs against adversarial attacks that were not seen during their training. The study evaluates classifiers’ performance against a robust defense benchmark which has knowledge of the distortion type, and introduces a summary metric called UAR (Unforeseen Attack Robustness), which measures robustness against such unforeseen distortion attacks.

The proposed three-step method evaluates a network’s performance when faced with unforeseen distortion attacks in a range of types and sizes:

Step one: Evaluate against diverse unforeseen distortion types. Researchers propose that evaluating typical adversarial defense systems on oft-studied Lp distortions does not sufficiently reflect their robustness against other adversarial attacks, and so introduced L1, L_2L2-JPEG, Elastic, and Fog attacks into their study.

Sample images (espresso maker) of the same strong attack applied to different defense models. Attacking stronger defenses causes greater visual distortions.

Step two: Choose a wide range of distortion sizes calibrated against strong models. Because too narrow a range of distortion sizes can compromise conclusions regarding robustness, researchers chose recognizable attack images with the widest possible range of distortion sizes.

UAR scores for adversarially trained models against adversarial attacks with different distortion types. A UAR score near 100 against an unforeseen adversarial attack implies performance comparable to a defense with prior knowledge of the attack, making this a challenging objective.

Step three: Benchmark adversarial robustness against adversarially trained models. Researchers computed the UAR scores of models as the average accuracy of their defense across multiple distortion sizes as compared with models trained using the adversarial attack images.

The study’s results illustrate the challenges and limitations involved in training against adversarial examples. Researchers conclude that the robustness a model gains through the adversarial training process “does not transfer broadly to unforeseen distortions.” Further, increasing robustness against known distortions can actually reduce robustness against unforeseen distortions, suggesting the need to either modify or move beyond today’s adversarial training techniques.

Researchers believe this methodology could be expanded to help evaluate model robustness against a more diverse set of unforeseen attacks.

The researchers’ code package, which includes a suite of attacks and adversarially trained models and calibrations that allow UAR to be easily computed, has been open-sourced on on GitHub. The paper Testing Robustness Against Unforeseen Adversaries is on arXiv.