$\begingroup$

Though I can only guess what was the intention of the authors, I interpret this passage as follows:

Unlike the sup-norm, the $L_2$ norm (or any $L_p$ norm for $p < \infty$) can be "forgiving" to large errors if they occur "infrequently" throughout the domain, whereas the sup-norm always takes the worst error. For a more concrete example, assume $d(x)=f(x)-g(x)$ is zero everywhere except on the measurable set $A$, on which $d(x) = n$ for $n \in \mathbb{N}$. Let $\lambda$ be the Lebesgue measure, and so If $\lambda(A) = \frac{\epsilon^2}{n^2}$, i.e. the large error occurs "infrequently", then $||d||_2 = \epsilon$ while $||f||_\infty = n$. This "forgiveness" property is not generally a bad thing, as in many cases it's actually quite desirable to ignore "infrequent" errors, but here lies the problem in the specific context of ML: we implicitly assumed that the measure of $A$ is proportional to the frequency of the data on the domain, which might be a reasonable assumption in other contexts (for bounded domains it's the same as assuming uniform distribution), while in ML we want to be agnostic to the data distribution. More specifically, consider the previous example but now under the case where the data distribution is densely packed exactly on $A$, i.e. $\mathrm{Pr}(x \in A) = 1$, which means the true error is $\mathbb{E}[d^2] = n$ -- quite far from the result of the $L_2$ norm bound. In other words, depending on the data distribution, the $L_2$ bound might be irrelevant, whereas the sup-norm is always relevant if you have no prior information on the distribution.

So to sum it up: under the general settings of ML, we do not make any assumptions on the data distribution, because we do not typically know what our learning algorithms will encounter in practice. Hence, it is preferable for theoretical results to be data independent. In the context of this specific paper which deals with the approximation properties of neural networks, using sup-norm ensure that the degree of approximation (how large networks have to be approximate a function) would not be affected by the data distribution because it always takes the worst error, while under $L_2$ norm the bounds might not be relevant for data distributions which treat seemingly "infrequent" parts of the domain is highly frequent ones.

As a final note, while it is generally preferable to have data independent results (because we ideally want to prove our algorithms work for any case), making assumptions on the data distribution can lead to significantly better guarantees for those cases on which we believe they apply. On the present day, there are certain behaviors that we observe in practice that "classical" ML theory cannot fully explain (e.g. why neural networks generalize in practice), and many believe that forgoing the data independent setting is a key step for addressing these issues.