The Regression Machine¶

While traditional tools worked well, there has been recently a Cambrian explosion of nontraditional layers. Consider the following conceptual layer. This layer isn't defined explicitly, but is instead defined implicitly as the solution of a sparse coding problem: $$\Psi(x)=\underset{y}{\mbox{argmin }}\big\{\tfrac{1}{2}\|Wy-x\|^{2}+\lambda\|y\|_{1}\big\}.$$ In the spirit of giving new names to old things, I dub this the Regression Machine (a better name would be the Sparse Coding Layer, but lets stick to this).

This layer behaves very differently from a regular deep learning layer and it might be worthwhile to consider what it does. $W$ is a set of weights, to be trained, and $\lambda$ is a parameter controlling how sparse the output is. Assuming a trained model, with sensible $W$, this layer is a sparse encoder. It takes $x$, its input and transforms it into its representation in structural primitives (the columns of $W$).

It is useful to think of a musical metaphor. The sound of a chord (our input $x$) is generated by a sparse superposition of sounds corresponding to a small number of notes (our sparse parameter $y$). Our layer takes this generative process and throws it in reverse, and attempts to, from the sound of a chord, to recover the individual notes.

If you have a background in engineering, these terms might be familiar to you in a different language. The generative process can be seen as the forward model,

and the problem of recovering the notes from its output, the inverse problem

This layer is therefore, paradoxically, the opposite of a regular deep learning layer. But it is this lopsided layer which makes the most sense, as it truly takes data from input space into explanation space - a higher level of abstraction.

Gregor and LeCun observe that sparse coding exhibits the property of mutual inhibition. Though $W$ may contain numerous similar columns, generally only one among these will be chosen to represent the input. It's no surprise then that this is the mechanism proposed by Olshausen et al to represent the early stages of visual processing.

Despite its implicit definition, we can still take its (sub) gradient, $$ \begin{align*}

abla\Phi(x)_{ij} & =\begin{cases} [(Z(W^{T}W)Z^{T})^{-1}ZW^{T}]_{ij} & y_{i}^{\star}

eq0\\ 0 & y_{i}=0 \end{cases}\\ \frac{\partial}{\partial W_{ij}}\Phi(x) & =\frac{\partial}{\partial W_{ij}}(Z(W^{T}W)Z^{T})^{-1}Z[W^{T}x-\lambda\mbox{sign}(y^{\star})]\\ Z & =\mbox{operator which removes rows $j$s for which $\Psi(x)_j = 0$} \end{align*}$$

Hence there is no technical roadblock in integrating this into a deep learning framework.

This leads to the troubling question - why isn't this kind of layer de-facto? The trouble with this layer is that it is expensive. Computing this layer requires the solution of a small optimization problem (on the order of the number of features in a datum). Measured in milliseconds, this may be cheap - the cost of solving this problem is on the order of solving a linear system. But it is still an order of magnitude more expensive than a simple matrix multiplication passed through a nonlinearity. Adding a single layer like this would potentially increase the cost of every training step a hundredfold.