For Valid Generalization the Size of the Weights is More Important than the Size of the Network For Valid Generalization the Size of the Weights is More Important than the Size of the Network Bartlett, Peter L. 1996

Paper summary

davidstutz

Barlett shows that lower generalization bounds for multi-layer perceptrons with limited sizes of the weights can be found using the so-called fat-shattering dimension. Similar to the classical VC dimensions, the fat shattering dimensions quantifies the expressiveness of hypothesis classes in machine learning. Specifically, considering a sequence of points $x_1, \ldots, x_d$, a hypothesis class $H$ is said to shatter this sequence if, for any label assignment $b_1, \ldots, b_d \in \{-1,1\}$, a function $h \in H$ exists that correctly classifies the sequence, i.e. $\text{sign}(h(x_i)) = b_i$. The VC dimension is the largest $d$ for which this is possible. The VC dimension has been studied for a wide range of machine learning models (i.e., hypothesis classes). Thus, it is well known that multi-layer perceptrons with at least two layers have infinite VC dimension – which seems natural as two-layer perceptrons are universal approximators. As a result, most bounds on the generalization performance of multi-layer networks (and, thus, also of more general deep networks) do not apply as the VC dimension is infinite. The fat-shattering dimension, in contrast, does not strictly require the sequence $x_1,\ldots, x_d$ to be correctly classified into the labels $b_1,\ldots, b_d$. Instead, the sequence is said to be $\gamma$-shattered if real values $r_1,\ldots,r_d$ exist such that for every labeling, $b_1,\ldots,b_d$, some some $h \in H$ satisfies $(h(x_i) – r_i)b_i \geq \gamma$. Note that the values $r_i$ are fixed across labelings, i.e., are chosen “before” knowing the labels. The fat-shattering dimension is the largest $d$ for which this is possible. As a result, the fat-shattering dimension relaxes the VC dimension in that the models in $H$ are allowed some “slack” (in lack of a better word). Note that $H$ contains real-valued functions. Based on this definition, Barlett shows that multi-layer perceptrons in which all layers have weights $w$ constrained as $\|w\|_1 \leq A$ scales with $A^{l(l + 1)}$. More importantly, however, the fat-shattering dimension is finite. Thus, generalization bounds based on the fat-shattering dimensions apply and are discussed by Barlett; I refer to the paper for details on the bound. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).