A New Angle on L2 Regularization

L2 Regularization: $err_{\text{train}} =$ ... $d_{\text{adv}} =$ ... Imagine two high dimensional clusters and a hyperplane separating them. Consider in particular the angle between:

the direction joining the two clusters' centroids and the normal to the hyperplane. In linear classification, this angle depends on the level of L2 regularization used.

— Can you explain why? —

Deep neural networks have been shown to be vulnerable to the adversarial example phenomenon: all models tested so far can have their classifications dramatically altered by small image perturbations . The following predictions were for instance made by a state-of-the-art network trained to recognize celebrities :

This result is puzzling for two reasons. First, it challenges a common belief according to which good generalization to novel data and robustness to small perturbations go hand in hand. Second, it constitutes a potential threat to real-world applications . Researchers at MIT have for instance recently constructed 3D objects that are misclassified under a wide distribution of angles and viewpoints . Understanding this phenomenon and improving deep networks' robustness has thus become an important research objective.

Several approaches have been explored already. The phenomenon has been described in detail and some theoretical analysis has been provided . Attempts have been made at designing more robust architectures or at detecting adversarial examples during evaluation . Adversarial training has also been introduced as a new regularization technique penalising adversarial directions . Unfortunately, the problem stays largely unresolved . Confronted with this difficulty, we propose to proceed from fundamentals: focusing on linear classification first and then increasing complexity incrementally.

A Toy Problem

In linear classification, adversarial perturbations are often understood as a property of the dot product in high dimension. A widespread intuition is that: "for high dimensional problems, we can make many infinitesimal changes to the input that add up to one large change to the output" . Here, we challenge this intuition and argue instead that adversarial examples exist when the classification boundary lies close to the data manifold—independently of the image space dimension. This idea was originally inspired by the work of Marron et al. on the data piling phenomenon affecting SVMs on high-dimensional low sample size data.

Setup

Let's start with a minimal toy problem: a two-dimensional image space where each image is a function of $a$ and $b$.

In this simple image space, we define two classes of images...

...which can be separated by an infinite number of linear classifiers. Consider for instance the line $\mathscr{L}_{\theta}$.

This raises a first question: if all the linear classifiers $\mathscr{L}_{\theta}$ separate $I$ and $J$ equally well, are they all equally robust to image perturbations?

Projected and mirror images

Consider an image $\boldsymbol{x}$ in class $I$. The closest image classified in the opposite class is the projected image of $\boldsymbol{x}$ on $\mathscr{L}_{\theta}$:

$$\boldsymbol{x}_p := \boldsymbol{x} - (\boldsymbol{x} \!\cdot\! \boldsymbol{w_{\theta}}) \; \boldsymbol{w_{\theta}}$$ Low confidence adversarial example

When $\boldsymbol{x}$ and $\boldsymbol{x}_p$ are very close to each other, we say that $\boldsymbol{x}_p$ is an adversarial example of $\boldsymbol{x}$. Observe though that $\boldsymbol{x}_p$ is classified with a

low confidence score (it lies on the boundary) and it is perhaps more interesting to consider high-confidence adversarial examples .

In the following, we focus on the mirror image of $\boldsymbol{x}$ through $\mathscr{L}_{\theta}$:

$$\boldsymbol{x}_m := \boldsymbol{x} - 2\,(\boldsymbol{x} \!\cdot\! \boldsymbol{w_{\theta}}) \; \boldsymbol{w_{\theta}}$$ High confidence adversarial example

By construction, $\boldsymbol{x}$ and $\boldsymbol{x}_m$ are at the same distance from the boundary and are classified with the same confidence level.

A mirror image as a function of $\theta$

Coming back to our toy problem, we can now plot an image $\boldsymbol{x}$ and its mirror image $\boldsymbol{x}_m$ as a function of $\theta$.

We see that the distance between $\boldsymbol{x}$ and $\boldsymbol{x}_m$ depends on the angle $\theta$. The two borderline cases are of particular interest.

When $\theta = 0$: $\mathscr{L}_{\theta}$ does not suffer from adversarial examples. $\boldsymbol{x}$ is classified in $I$ with high confidence and

$\boldsymbol{x}_m$ is classified in $J$ with high confidence,

in agreement with human observers. When $\theta \to \pi/2$: $\mathscr{L}_{\theta}$ suffers from strong adversarial examples. $\boldsymbol{x}$ is classified in $I$ with high confidence and

$\boldsymbol{x}_m$ is classified in $J$ with high confidence,

yet $\boldsymbol{x}_m$ is visually indistinguishable from $\boldsymbol{x}$.

This raises a second question: if adversarial examples exist when $\mathscr{L}_{\theta}$ is strongly tilted, what makes $\mathscr{L}_{\theta}$ tilt in practice?

Overfitting and L2 regularization

Our working hypothesis is that the classification boundary defined by standard linear learning algorithms such as Support Vector Machines (SVM) or logistic regression tilts by overfitting noisy data points in the training set. This hypothesis is supported by the theoretical result of Xu et al. relating robustness to regularization in SVMs. It can also be tested experimentally: techniques designed to reduce overfitting such as L2 regularization are expected to mitigate the adversarial example phenomenon.

Consider for instance a training set containing one noisy data point $\boldsymbol{p}$.

If we train an SVM or a logistic regression model on this training set, we observe two possible behaviours.

Without L2 regularization: The classification boundary is strongly tilted. Most of the leeway available to fit the training data resides in the tilting angle of the boundary. Here, the data point $\boldsymbol{p}$ can be classified correctly, but the classifier obtained is then vulnerable to adversarial examples. With L2 regularization: The classification boundary is not tilted. L2 regularization reduces overfitting by allowing some training samples to be misclassified. When enough regularization is used, the data point $\boldsymbol{p}$ is ignored and the classifier obtained is robust to adversarial examples.

At this point, one might legitimately wonder—what does a 1-dimensional data manifold lying in a 2-dimensional image space have to do with high-dimensional natural images?

Adversarial Examples in Linear Classification

In the following, we show that the two main ideas introduced in the previous toy problem stay valid in the general case: adversarial examples exist when the classification boundary lies close to the data manifold and L2 regularization controls the tilting angle of the boundary.

Scaling the Loss Function

Let's start with a simple observation: during training, the norm of the weight vector acts as a scaling parameter on the loss function.

Setup

Let $I$ and $J$ be two classes of images and $\mathcal{C}$ a hyperplane boundary defining a linear classifier in $\mathbb{R}^d$. $\mathcal{C}$ is specified by a normal weight vector $\boldsymbol{w}$ and a bias $b$. For an image $\boldsymbol{x}$ in $\mathbb{R}^d$, we call raw score of $\boldsymbol{x}$ through $\mathcal{C}$ the value: $$s(\boldsymbol{x}) := \boldsymbol{w} \!\cdot\! \boldsymbol{x} + b$$ The raw score can be seen as a signed distance between $\boldsymbol{x}$ and the classification boundary defined by $\mathcal{C}$. In particular: $$\boldsymbol{x} \text{ is classified in } \mathrel{\Bigg|} \begin{array}{@{}c@{}} \text{$I$ if } s(\boldsymbol{x}) \leq 0 \\ \text{$J$ if } s(\boldsymbol{x}) \geq 0 \end{array}$$

Now, consider a training set $T$ of $n$ pairs $(\boldsymbol{x},y)$ where $\boldsymbol{x}$ is an image and $y = \{ -1 \text{ if } \boldsymbol{x} \in I \;|\; 1 \text{ if } \boldsymbol{x} \in J\}$ is its label. We are interested in the distributions of the following quantities over $T$:

$s(\boldsymbol{x})$ If we plot the histograms of the raw scores over the training set, we typically get two clusters of opposite signs $y\,s(\boldsymbol{x})$ Multiplying by the label allows us to distinguish the correctly classified data from the misclassified data $f\big(y\,s(\boldsymbol{x})\big)$ We can then attribute a penalty to

each training point $\boldsymbol{x}$ by applying a

loss function to $y\,s(\boldsymbol{x})$

This leads to the notion of empirical risk $R(\boldsymbol{w},b)$ for the classifier $\mathcal{C}$ defined as the average penalty over the training set $T$: $$R(\boldsymbol{w},b) := \frac{1}{n}\sum_{(\boldsymbol{x},y) \in T}f\big(y\,s(\boldsymbol{x})\big)$$ In general, learning a linear classifier consists of finding a weight vector $\boldsymbol{w}$ and a bias $b$ minimising $R(\boldsymbol{w},b)$ for a well chosen loss function $f$.

In binary classification, three notable loss functions are:

The 0-1 indicator function $$f:z \rightarrow \begin{cases} 1 \quad & \text{if } z \leq 0\\ 0 & \text{if } z \gt 0 \end{cases}$$ The hinge loss $$f:z \rightarrow \max(1-z,0)$$ The softplus loss $$f:z \rightarrow \ln\left(1+e^{-z}\right)$$

With the 0-1 indicator function, the empirical risk is simply the error rate on $T$. In a sense, this is the optimal loss function as minimizing the error rate is often the desired objective in practice. Unfortunately, it is incompatible with gradient descent (there is no gradient to descend: the derivative is null everywhere).

This limitation is overcome in the hinge loss (used in SVM) and the softplus loss (used in logistic regression) by replacing the unit penalty on the misclassified data with a strictly decreasing penalty. Note that both the hinge loss and the softplus loss also penalize some correctly classified data in the neighbourhood of the boundary, effectively enforcing a safety margin.

The scaling parameter $\lVert\boldsymbol{w}\rVert$

An important point previously overlooked is that the signed distance $s(\boldsymbol{x})$ is scaled by the norm of the weight vector. If $d(\boldsymbol{x})$ is the actual signed Euclidean distance between $\boldsymbol{x}$ and $\mathcal{C}$, we have: $$d(\boldsymbol{x}):= \boldsymbol{\hat{w}} \!\cdot\! \boldsymbol{x} + b^\prime \quad\quad\quad \text{where} \quad\quad \textstyle \boldsymbol{\hat{w}} := \frac{\boldsymbol{w}}{\lVert\boldsymbol{w}\rVert}\,\quad b^\prime := \frac{b}{\lVert\boldsymbol{w}\rVert}$$ $$\text{and} \quad s(\boldsymbol{x}) = \lVert\boldsymbol{w}\rVert\,d(\boldsymbol{x})$$ Hence, the norm $\lVert\boldsymbol{w}\rVert$ can be interpreted as a scaling parameter for the loss function in the expression of the empirical risk:

$$R(\boldsymbol{w},b) = \frac{1}{n}\sum_{(\boldsymbol{x},y) \in T}f\big(\;\lVert\boldsymbol{w}\rVert \times y\,d(\boldsymbol{x})\big)$$

Let us define the scaled loss function $f_{\lVert\boldsymbol{w}\rVert}:z \rightarrow f(\lVert\boldsymbol{w}\rVert \times z)$.

We observe that the 0-1 indicator function is invariant to rescaling while the hinge loss and the softplus loss are strongly affected.

0-1 indicator function hinge loss softplus loss $\lVert\boldsymbol{w}\rVert = 10$ ... Influence of the scaling parameter $\lVert\boldsymbol{w}\rVert$. The histograms represent the values of $y\,d(\boldsymbol{x})$ over $T$ for a given $\boldsymbol{w}$ and $b$ and the blue lines represent the values of $f_{\lVert\boldsymbol{w}\rVert}$.

Remarkably, the hinge loss and the softplus loss behave in the same way for extreme values of the scaling parameter.

When$\;\lVert\boldsymbol{w}\rVert\;$is large The hinge loss and the softplus loss penalize

only the misclassified data—and do so linearly.

More precisely, both losses satisfy: Hinge loss

\begin{align} \max(1-\lVert\boldsymbol{w}\rVert\,y\,d(\boldsymbol{x}),0)& \quad\> \;=\; \quad\> \textstyle\lVert\boldsymbol{w}\rVert \, \max\left(\lVert\boldsymbol{w}\rVert^{-1}-y\,d(\boldsymbol{x}),0\right)\\ &\;\underset{\lVert\boldsymbol{w}\rVert \to +\infty}{\approx}\; \lVert\boldsymbol{w}\rVert \, \max\left(-y\,d(\boldsymbol{x}),0\right) \end{align}

Softplus loss

\begin{align} \ln\left(1+e^{-\lVert\boldsymbol{w}\rVert\,y\,d(\boldsymbol{x})}\right) &\;\underset{\lVert\boldsymbol{w}\rVert \to +\infty}{\approx}\; \begin{cases} -\lVert\boldsymbol{w}\rVert\,y\,d(\boldsymbol{x}) \quad & \text{if } y\,d(\boldsymbol{x}) \leq 0\\ 0 & \text{if } y\,d(\boldsymbol{x}) \gt 0 \end{cases}\\ &\;\underset{\lVert\boldsymbol{w}\rVert \to +\infty}{\approx}\; \lVert\boldsymbol{w}\rVert \, \max\left(-y\,d(\boldsymbol{x}),0\right) \end{align} $$f_{\lVert\boldsymbol{w}\rVert}\big(y\,d(\boldsymbol{x})\big) \,\underset{\lVert\boldsymbol{w}\rVert \to +\infty}{\approx}\, \lVert\boldsymbol{w}\rVert \, \max\left(-y\,d(\boldsymbol{x}),0\right)$$ For convenience, we name the set of misclassified data: $$M := \{ (\boldsymbol{x},y) \in T \;|\; y\,d(\boldsymbol{x}) \leq 0 \}$$ and we can then write the empirical risk as: $$R(\boldsymbol{w},b) \;\underset{\lVert\boldsymbol{w}\rVert \to +\infty}{\approx}\; \lVert\boldsymbol{w}\rVert \, \biggl(\frac{1}{n}\sum_{(\boldsymbol{x},y) \in M}\!\left(-\,y\,d(\boldsymbol{x})\right)\biggr)$$ This expression contains a term which we call the error distance: $$d_{\text{err}} := \frac{1}{n}\sum_{(\boldsymbol{x},y) \in M}\!\left(-\,y\,d(\boldsymbol{x})\right)$$ It is positive and can be interpreted as the average distance by which each training sample is misclassified by $\mathcal{C}$ (with a null contribution for the correctly classified data). It is related—although not exactly equivalent—to the training error. A small error distance $d_{\text{err}}$ does not guarantee the training error $err_{\text{train}}$ to be small. For instance in the worst case, when all the data lies exactly on the boundary, $d_{\text{err}} = 0$ and $err_{\text{train}} = 100\%$.

Finally we have: $$\text{minimize: }R(\boldsymbol{w},b) \;\underset{\lVert\boldsymbol{w}\rVert \to +\infty}{\iff}\; \text{minimize: $d_{\text{err}}$}$$ In words, when $\lVert\boldsymbol{w}\rVert$ is large, minimizing the empirical risk for the hinge loss or the softplus loss is equivalent to minimizing the error distance, which is similar to minimizing the error rate on the training set.

When$\;\lVert\boldsymbol{w}\rVert\;$is small The hinge loss and the softplus loss penalize the entire training data linearly.

More precisely, both losses satisfy: Hinge loss

$$\max(1-\lVert\boldsymbol{w}\rVert\,y\,d(\boldsymbol{x}),0) \;\underset{\lVert\boldsymbol{w}\rVert\ \to 0}{=}\; 1-\lVert\boldsymbol{w}\rVert\,y\,d(\boldsymbol{x})$$ $$\alpha = 1 \quad\text{and}\quad \beta = 1$$ Softplus loss

$$\ln\left(1+e^{-\lVert\boldsymbol{w}\rVert\,y\,d(\boldsymbol{x})}\right) \;\underset{\lVert\boldsymbol{w}\rVert\ \to 0}{\approx}\; \ln(2)-\frac{1}{2}\,\lVert\boldsymbol{w}\rVert\,y\,d(\boldsymbol{x})$$ $$\alpha = \ln(2) \quad\text{and}\quad \beta = \frac{1}{2}$$ $$f_{\lVert\boldsymbol{w}\rVert}\big(y\,d(\boldsymbol{x})\big) \,\underset{\lVert\boldsymbol{w}\rVert\ \to 0}{\approx}\, \alpha - \beta\;\lVert\boldsymbol{w}\rVert\,y\,d(\boldsymbol{x})$$ for some positive values $\alpha$ and $\beta$.

We can then write the empirical risk as: $$R(\boldsymbol{w},b) \;\underset{\lVert\boldsymbol{w}\rVert\ \to 0}{\approx}\; \alpha - \beta\;\lVert\boldsymbol{w}\rVert \, \biggl(\frac{1}{n}\sum_{(\boldsymbol{x},y) \in T}y\,d(\boldsymbol{x})\biggr)$$ This expression contains a term which we call the adversarial distance: $$d_{\text{adv}} := \frac{1}{n}\sum_{(\boldsymbol{x},y) \in T}y\,d(\boldsymbol{x})$$ It is the mean distance between the images in $T$ and the classification boundary $\mathcal{C}$ (with a negative contribution for the misclassified images). It can be viewed as a measure of robustness to adversarial perturbations: when $d_{\text{adv}}$ is high, the number of misclassified images is limited and the correctly classified images are far from $\mathcal{C}$.

Finally we have: $$\text{minimize: }R(\boldsymbol{w},b) \;\underset{\lVert\boldsymbol{w}\rVert \to 0}{\iff}\; \text{maximize: } d_{\text{adv}}$$ In words, when $\lVert\boldsymbol{w}\rVert$ is small, minimizing the empirical risk for the hinge loss or the softplus loss is equivalent to maximizing the adversarial distance, which can be interpreted as minimizing the phenomenon of adversarial examples.

Closing remarks

In practice, the value of $\lVert\boldsymbol{w}\rVert$ can be controlled by adding a regularization term to the empirical risk, yielding the regularized loss:

$$L(\boldsymbol{w},b) := R(\boldsymbol{w},b) \; + \; \lambda\,\lVert\boldsymbol{w}\rVert^2$$

A small regularization parameter $\lambda$ lets $\lVert\boldsymbol{w}\rVert$ grow unchecked while a larger $\lambda$ encourages $\lVert\boldsymbol{w}\rVert$ to shrink.

In summary, the two standard models used in linear classification

(SVM and logistic regression) balance between two objectives:

they minimize the error distance when regularization is low and

they maximize the adversarial distance when regularization is high.

Adversarial Distance and Tilting Angle

The adversarial distance emerged in the previous section as a measure of robustness to adversarial perturbations. Rather conveniently, it can be expressed as a function of a single parameter: the angle between the classification boundary and the nearest centroid classifier.

If $T_I$ and $T_J$ are the restrictions of $T$ to the elements in $I$ and $J$ respectively, we can write: \begin{align} d_{\text{adv}} &= \frac{1}{n}\sum_{(\boldsymbol{x},y) \in T}y\,d(\boldsymbol{x})\\ &= \frac{1}{n}\,\biggl[\;\sum_{\boldsymbol{x} \in T_I}(-\boldsymbol{\hat{w}} \!\cdot\! \boldsymbol{x} - b^\prime) \;\;+\;\; \sum_{\boldsymbol{x} \in T_J}(\boldsymbol{\hat{w}} \!\cdot\! \boldsymbol{x} + b^\prime)\,\biggr] \end{align} If $T_I$ and $T_J$ are balanced $\left(n = 2\,n_I = 2\,n_J\right)$: \begin{align} d_{\text{adv}} &= -\frac{1}{2\,n_I}\sum_{\boldsymbol{x} \in T_I}\boldsymbol{\hat{w}} \!\cdot\! \boldsymbol{x} \;\;+\;\; \frac{1}{2\,n_J}\sum_{\boldsymbol{x} \in T_J}\boldsymbol{\hat{w}} \!\cdot\! \boldsymbol{x}\\ &= \frac{1}{2}\;\boldsymbol{\hat{w}} \!\cdot\! \biggl[\biggl(\frac{1}{n_J} \sum_{\boldsymbol{x} \in T_J}\boldsymbol{x}\biggr) - \biggl(\frac{1}{n_I} \sum_{\boldsymbol{x} \in T_I}\boldsymbol{x}\biggr)\biggr] \end{align} If $\boldsymbol{i}$ and $\boldsymbol{j}$ are the centroids of $T_I$ and $T_J$ respectively: $$d_{\text{adv}} = \frac{1}{2}\;\boldsymbol{\hat{w}} \!\cdot\! (\boldsymbol{j}-\boldsymbol{i})$$ We now introduce the nearest centroid classifier, which has unit normal vector $\boldsymbol{\hat{z}} = (\boldsymbol{j}-\boldsymbol{i})/\lVert\boldsymbol{j}-\boldsymbol{i}\rVert$: $$d_{\text{adv}} =\frac{1}{2}\;\lVert\boldsymbol{j}-\boldsymbol{i}\rVert\;\boldsymbol{\hat{w}} \!\cdot\! \boldsymbol{\hat{z}}$$ Finally, we call the plane containing $\boldsymbol{\hat{w}}$ and $\boldsymbol{\hat{z}}$ the tilting plane of $\mathcal{C}$ and we we call the angle $\theta$ between $\boldsymbol{\hat{w}}$ and $\boldsymbol{\hat{z}}$ the tilting angle of $\mathcal{C}$: $$d_{\text{adv}} = \frac{1}{2}\;\lVert\boldsymbol{j}-\boldsymbol{i}\rVert\;\cos(\theta)$$

This equation can be interpreted geometrically in the tilting plane:

$d_{\text{adv}}$ is the average distance between

the classification boundary $\mathcal{C}$

and the two centroids $\boldsymbol{i}$ and $\boldsymbol{j}$

On a given training set $T$ where the distance between the two centroids $\lVert\boldsymbol{j}-\boldsymbol{i}\rVert$ is fixed, $d_{\text{adv}}$ depends only on the tilting angle $\theta$. Two observations follow:



The adversarial example phenomenon is minimized by the nearest centroid classifier ($\theta = 0$). ...and the classifiers parallel to it: $d_{\text{adv}}$ is independent of the bias $b$. This explains why the models defined by SVM are poorly adjusted when regularization is high, resulting in very high training and test errors (see for instance the classification of 1s vs 8s when $\lambda = 10^7$ in the next section).

Adversarial examples can be arbitrarily strong when $\theta \to \pi/2$

(as was the case with the classifier $\mathscr{L}_{\theta}$ in the toy problem section).

Example: SVM on MNIST

We now illustrate the previous considerations on binary classification of MNIST digits. For each possible pair of digit classes, we train multiple SVM models $(\boldsymbol{w}, b)$ for a regularization parameter $\lambda \in [10^{-1}, 10^7]$ using a training set of $3000$ images per class. More precisely, we train for each pair of digit classes 81 models with a regularization parameter $\lambda = 10^{\alpha}$ with $\alpha$ ranging from $-1$ to $7$ by steps of $0.1$.

We start by plotting the distributions of the distances $y\,d(\boldsymbol{x})$ between the training data and the boundary as a function of the regularization parameter $\lambda$ (grey histograms). We superimpose the loss function $f_{\lVert\boldsymbol{w}\rVert}$ as scaled after the convergence of each model (blue line).

We see that the scaling of the hinge loss has a clear influence on the model obtained. Unfortunately, minimising the training error and maximizing the adversarial distance are conflicting goals: $err_{train}$ is minimized when $\lambda$ is small and $d_{adv}$ is maximized when $\lambda$ is large. Note that the test error is minimized for an intermediate level of regularization $\lambda_{\text{optimal}}$. When $\lambda \lt \lambda_{\text{optimal}}$, the classifier is overfitted and when $\lambda \gt \lambda_{\text{optimal}}$, the classifier is underfitted.

To get a better understanding of how the two objectives are balanced, we can look at the training data under a different point of view.

We first compute the unit weight vector $\boldsymbol{\hat{z}}$ of the nearest centroid classifier. Then for each SVM model $(\boldsymbol{w}, b)$, we compute the unit vector $\boldsymbol{\hat{n}}$ such that $(\boldsymbol{\hat{z}},\boldsymbol{\hat{n}})$ is an orthonormal basis of the tilting plane of $\boldsymbol{w}$. We do this by using the Gram-Schmidt process: $$\boldsymbol{\hat{n}} = \frac{\boldsymbol{n}}{\lVert\boldsymbol{n}\rVert} \quad\quad \text{with} \quad\quad \boldsymbol{n} = \boldsymbol{w} - (\boldsymbol{w} \!\cdot\! \boldsymbol{\hat{z}})\,\boldsymbol{\hat{z}}$$ Finally, we project the training data in $(\boldsymbol{\hat{z}},\boldsymbol{\hat{n}})$:

The horizontal direction passes through the two centroids and the vertical direction is chosen such that $\boldsymbol{w}$ belongs to the plane (the hyperplane boundary then appears as a line). Remark also that since $(\boldsymbol{\hat{z}},\boldsymbol{\hat{n}})$ is an orthonormal basis, the distances in this plane are actual pixel distances. To understand why the data points appear to be moving around when $\lambda$ varies, one needs to imagine the tilting plane rotating around $\boldsymbol{\hat{z}}$ in the 784-dimensional input space (thus showing a different section of the 784-dimensional training data for each value of $\lambda$).

For high regularization levels, the model is parallel to the nearest centroid classifier and the adversarial distance is maximized. As $\lambda$ decreases, the classification boundary improves its fit of the training data by tilting towards directions of low variance. Eventually, a small number of misclassified training samples is overfitted, resulting in a very small adversarial distance and a weight vector that is hard to interpret.

Finally, we can look at two representative images $\boldsymbol{x}$, $\boldsymbol{y}$ (one per class) and their mirror images $\boldsymbol{x}_m$, $\boldsymbol{y}_m$ for each model. Their projections in the tilting plane of $\boldsymbol{w}$ give a very intuitive picture of the adversarial example phenomenon in linear classification:

The model is sensitive to strong adversarial examples ($||\boldsymbol{x}_m-\boldsymbol{x}|| \to 0$ and $||\boldsymbol{y}_m-\boldsymbol{y}|| \to 0$) when the tilting angle approaches $\pi/2$. This is a symptom of strong overfitting, and whether it occurs or not depends on the difficulty to separate the two classes (compare for instance the classification of $7s$ versus $9s$ and the classification of $0s$ versus $1s$).

Adversarial Examples in Neural Networks

Thanks to the equivalence between adversarial distance and tilting angle, the linear case is simple enough to be visualized in the plane. In neural networks however, the class boundary is not flat and the adversarial distance cannot be reduced to a single parameter. Nonetheless, some similarities with the linear case remain.

First Step: a 2-Layer Binary Network

Let $\mathcal{N}$ be a 2-layer network with a single output defining a non-linear binary classifier in $\mathbb{R}^d$. The first layer of $\mathcal{N}$ is specified by a weight matrix $W_1$ and a bias vector $b_1$ and the second layer of $\mathcal{N}$ is specified by a weight vector $W_2$ and bias $b_2$. We assume that the two layers are separated by a layer $\phi$ of rectified linear units applying the function $z \to \max(0,z)$ element-wise. For an image $x$ in $\mathbb{R}^d$, we call the raw score of $x$ through $\mathcal{N}$ the value: $$s(x) := W_2\;\phi(W_1\,x + b_1) + b_2$$ Similarly to the linear case, the empirical risk on $T$ for a loss function $f$ can be written: $$R(W_1,b_1;W_2,b_2) := \frac{1}{n}\sum_{(x,y) \in T}f\big(y\,s(x)\big)$$ and training $\mathcal{N}$ consists in finding $W_1$, $b_1$, $W_2$ and $b_2$ minimizing $R$ for a well chosen $f$.

$\phi$ is piecewise linear and around each image $x$ there is a local linear region $\mathcal{L}_x$ within which: $$\phi(W_1 \, x + b_1) = W_1^x \, x + b_1^x$$ where $W_1^x$ and $b_1^x$ are obtained by zeroing some lines in $W_1$ and $b_1$ respectively. More precisely, the $i^{th}$ lines in $W_1^x$ and $b_1^x$ are: $$\left(W_1^x\,,\,b_1^x\right)_i = \begin{cases} \left(W_1\,,\,b_1\right)_i \quad & \text{if } \left(W_1\,x + b_1\right)_i \gt 0\\ 0 & \text{if } \left(W_1\,x + b_1\right)_i \leq 0 \end{cases}$$ Within $\mathcal{L}_x$, the raw score can thus be written: $$s(x) = W_2 W_1^x \, x + W_2 b_1^x + b_2$$ This can be seen as the raw score of a local linear classifier $\mathcal{C}_x$ and our analysis of the linear case then applies almost without modifications. First, we observe that $s(x)$ is a scaled distance. If $d(x)$ is the actual signed Euclidean distance between $x$ and $\mathcal{C}_x$, we have: $$s(x) = \lVert W_2 W_1^x \rVert\,d(x)$$

Notes ▸ $d(x)$ can also be viewed as a linear approximation of the distance between $x$ and the boundary defined by $\mathcal{N}$ (the distance to the nearest adversarial example). ▸ $W_2 W_1^x$ is the gradient of $\mathcal{N}$ within $\mathcal{L}_x$. It is the adversarial direction for $x$, computed by backpropagation in practice.

The norm $\lVert W_2 W_1^x \rVert$ can then be interpreted as a scaling parameter for the loss function (the scaling is now local, dependent on $x$). One simple way to control all the local scalings simultaneously is by adding an L2 regularization term to the empirical risk acting on the norms $\lVert W_1 \rVert$ and $\lVert W_2 \rVert$ independently (remember that the weights in $W_1^x$ are a subset of the weights in $W_1$). With gradient descent, this is equivalent to decaying the weights $W_1$ and $W_2$ at every iteration. More precisely, for a learning rate $\eta$ and a decaying factor $\lambda$, the weight decay update is: $$W_1 \leftarrow W_1 - \eta\,\lambda\,W_1 \quad \text{and} \quad W_2 \leftarrow W_2 - \eta\,\lambda\,W_2$$

With a small decaying factor$\;\lambda$, the scaling parameter $\lVert W_2 W_1^x \rVert$ is allowed to grow unrestricted and the loss penalizes only the misclassified data. Minimizing the empirical risk is then equivalent to minimizing the error on the training set.

the scaling parameter $\lVert W_2 W_1^x \rVert$ is allowed to grow unrestricted and the loss penalizes only the misclassified data. Minimizing the empirical risk is then equivalent to minimizing the error on the training set. As the decaying factor$\;\lambda\;$increases, the scaling parameter $\lVert W_2 W_1^x \rVert$ decreases and the loss starts penalizing more and more of the correctly classified data, pushing it further away from the boundary. Under this light, L2 weight decay can be seen as a form of adversarial training.

In summary, L2 regularization acts as a scaling mechanism on the

loss function, both in linear classification and in small neural nets.

With gradient descent, using a high weight decay

results in a simple form of adversarial training .



Second Step: General Case

The previous analysis can be generalized to more layers and even to non-piecewise-linear activation functions. The important observation is that we always have: $$s(x) = \lVert

abla_{\!x} \, s \rVert\, d(x)$$ Where $

abla_{\!x} \, s$ is the gradient of the raw score on $x$, and $d(x)$ is a linear approximation of the distance between $x$ and the boundary defined by the network. The norm $\lVert

abla_{\!x} \, s \rVert$ then constitutes a scaling parameter for the loss function which can be controlled with weight decay.

This idea can also be extended beyond binary classification. In the multiclass case, the raw score becomes a vector whose elements are called the logits. Each logit $s_i(x)$ is then transformed into a probability $p_i(x)$ by applying the softmax function: $$p_i(x) := \text{softmax}_i(s(x)) := \frac{e^{s_i(x)}}{\displaystyle \sum_j \textstyle e^{s_j(x)}}$$ For an image/label pair $(x,y)$, the probability associated with the correct class is $p_y(x)$. The log-likelihood loss function encourages it to come close to $1$ by attributing the following penalty to $(x,y)$:

$$f(x,y) := -\log(p_y(x))$$ Log-likelihood loss function

Now, varying weight decay influences the scaling of the logits, effectively acting as a temperature parameter for the softmax function. We can redefine the softmax function as: $$\text{softmax}_i(s(x),t) := \frac{e^{\frac{s_i(x)}{t}}}{\displaystyle \sum_j \textstyle e^{\frac{s_j(x)}{t}}}$$ where $t$ is the temperature parameter. It was introduced in the context of network distillation and controls the softness of the softmax function. When weight decay is very low, the probability distributions generated are close to one-hot encodings ($p_y(x) \approx 0 \text{ or } 1$) and only the misclassified data produces non-zero penalties. With higher weight decay however, the probability distributions generated become smoother and the correctly classified data starts to participate to the training, thus preventing overfitting.

In practice, a number of observations suggest that modern deep networks are under-regularized:

They are often poorly calibrated and produce overconfident predictions . They often converge to zero training error, even on a random labelling of the data . They are often vulnerable to linear attacks of small magnitude .

Example: LeNet on MNIST

Is it possible to regularize a neural network against adversarial examples by only using weight decay? The idea is simple enough and has been considered before: Goodfellow et al. have observed that adversarial training is "somewhat similar to L1 regularization" in the linear case. However, the authors reported that when training maxout networks on MNIST, an L1 weight decay coefficient of $0.0025$ "was too large, and caused the model to get stuck with over $5\%$ error on the training set. Smaller weight decay coefficients permitted successful training but conferred no regularization benefit." We put the idea to the test once again and our observations are more nuanced. If using a high weight decay is clearly not the panacea, we found that it does help reduce the adversarial examples phenomenon, at least in simple setups.

Consider LeNet on MNIST (10-class problem). We use the baseline MatConvNet implementation with the following architecture:

We train one version of this network with a low weight decay of $10^{-4}$ and one version with a high weight decay of $10^{-1}$ (we refer to the two versions as LeNet low and LeNet high respectively). We keep all the other parameters fixed: we train for $50$ epochs, use a batch size of $300$, a learning rate of $0.0005$ and a momentum of $0.9$.

We can make several observations. To start, let's plot the training and test errors for the two networks as a function of the epoch.

We see that LeNet high is less overfitted (the train and test errors are approximately equal at the end of the training) and performs slightly better than LeNet low (final test error of $1.2\%$ versus $1.6\%$).

We can also inspect the weights that have been learnt. Below, we compute their root mean square value (RMS) and show a random selection of filters for each convolutional layer.

As expected, the weights learnt with a higher weight decay have a much lower RMS. The filters of LeNet high are also smoother than the filters of LeNet low (see the presence of clean edge detectors in $\text{Conv1}$ and $\text{Conv2}$) and their magnitudes vary more within each convolutional layer (see the presence of uniformly gray filters in $\text{Conv2}$ and $\text{FC1}$).

Finally, let's submit the two networks to the same visual evaluation: for a random instance of each digit, we generate a high confidence adversarial example targeted to perform a cyclic permutation of the labels $0 \to 1$, $1 \to 2$, ..., $9 \to 0$. Specifically, each adversarial example is generated by performing gradient ascent on the probability of the desired label until the median value of $0.95$ is reached. On each network, we set the temperature $t$ of the softmax layer such that the median classification score over the test set is 0.95.



For a target label $l$, the gradient $

abla_{\!x}\, p_l(x)$ (computed by backpropagation) gives the direction of steepest ascent towards $l$. Here, we generated each adversarial example $x^\prime$ by iterating the following update rule: $$x^\prime = \text{clip}_{[0,255]} \left(x^\prime + 1.0 \times \frac{

abla_{\!x^\prime}\, p_l(x^\prime)}{\lVert

abla_{\!x^\prime}\, p_l(x^\prime) \rVert}\right)$$ until $p_l(x^\prime) = 0.95$. We show below the $10$ original images $\text{OI}$ with their corresponding adversarial examples $\text{AE}$ and adversarial perturbations $\text{Pert}$ for the two networks.

We see that LeNet high is less susceptible to adversarial examples than LeNet low : the adversarial perturbations have higher L2 norms and are more meaningful for human observers.

Thoughts Moving Forward

Despite the widespread interest it has generated for several years now, and despite its significance for the field of machine learning both in theory and in practice, the adversarial example phenomenon has so far retained much of its intrigue. Our main goal here was to provide a clear and intuitive picture of the phenomenon in the linear case, hopefully constituting a solid base from which to move forward. Incidentally, we showed that L2 weight decay plays a more significant role than previously suspected in a small neural net on MNIST.

Unfortunately, the story gets more complicated with deeper models on more sophisticated datasets. In our experience, the more non-linear the model becomes and the less weight decay seems to be able to help. This limitation may be superficial and it is perhaps worth exploring the ideas introduced here a bit further (for example, we should probably pay more attention to the scaling of the logits during training). Or the high non-linearity of deep networks might constitute a fundamental obstacle to the type of first-order adversarial training that L2 regularization implements. Our feeling is that a truly satisfying solution to the problem will likely require profoundly new ideas in deep learning.