An engraving of the Turk from Karl Gottlieb von Windisch’s 1784 book Inanimate Reason (wiki)

In my previous post, while discussing the importance of DSLs in ML and AI, we mentioned the idea of Software 2.0, introduced by Andrej Karpathy:

Software 2.0 is written in neural network weights. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of hard (I tried). Instead, we specify some constraints on the behavior of a desirable program (e.g., a dataset of input output pairs of examples) and use the computational resources at our disposal to search the program space for a program that satisfies the constraints. In the case of neural networks, we restrict the search to a continuous subset of the program space where the search process can be made (somewhat surprisingly) efficient with back-propagation and stochastic gradient descent [19]

Deep Learning is eating software [13]. In this post, we’ll dig deeper in this.

Automatic Differentiation

What is Automatic Differentiation (AD)? There are countless of resources available online on the topic [0][6].

Ryan Adams, You Should Be Using Automatic Differentiation, 2016

In a nutshell:

Forward Mode computes directional derivatives, also known as tangents. The directional derivative can be evaluated without explicitly computing the Jacobian [4]. In other words, one sweep of forward mode can calculate one column vector of the Jacobian, Jẋ, where ẋ is a column vector of seeds [7] Reverse Mode computes directional gradients, that is one sweep of reverse mode can calculate one row vector of the Jacobian, ŷJ, where ŷ is a row vector of seeds [7] The computational cost of one sweep forward or reverse is roughly equivalent, but reverse mode requires access to intermediate variables, requiring more memory. Reverse mode AD is best suited for F: R^n -> R , while forward mode AD is best suited for G: R -> R^m . For other cases with n > 1 and m > 1 , the choice is non trivial. Backpropagationis merely a specialised version of automatic differentiation [2]: backpropagation is sometimes also known as reverse mode automatic differentiation. Check [3] for some historical background.

Implementing Automatic Differentiation

The implementation of Automatic Differentiation is an interesting software engineering topic. [15] identifies two pricipal ways to implement Automatic Differentiation:

Operator Overloading — […] one can replace the types of the floating-point variables with a new type that contains additional derivative information, and overload the arithmetic operations for this new type so as to propagate this derivative information along Program Transformation — One can instead decide to explicitly build a new source code that computes the derivatives. This is very similar to a compiler, except that it produces source code. This approach is more development-intensive than Operator Overloading, which is one reason why Operator Overloading AD tools appeared earlier and are more numerous. [15]

autograd is likely one of the most frequently used automatic differentiation libraries:

autograd is a great place to start learning how to implement Automatic Differentiation:

To compute the gradient, Autograd first has to record every transformation that was applied to the input as it was turned into the output of your function. To do this, Autograd wraps functions (using the function primitive ) so that when they're called, they add themselves to a list of operations performed. Autograd's core has a table mapping these wrapped primitives to their corresponding gradient functions (or, more precisely, their vector-Jacobian product functions). To flag the variables we're taking the gradient with respect to, we wrap them using the Box class. You should never have to think about the Box class, but you might notice it when printing out debugging info. After the function is evaluated, Autograd has a graph specifying all operations that were performed on the inputs with respect to which we want to differentiate. This is the computational graph of the function evaluation. To compute the derivative, we simply apply the rules of differentiation to each node in the graph. [autograd]

This “boxing” is the OOD flavour of operator overloading. Another alternative, based in F# is DiffSharp:

DiffSharp shows how to build AD with appropriate Types. Other resources, for those familiar with Haskell, include [24] and [25]. I feel overloading shines in a functional static typed set-up:

An Haskell implementation of AD [24]

Open issues: Control Flow, In-Place Operations and Aliasing

It is crucial to note that Automatic Differentiation is applicable to code that contains control flow (branching, looping, ..). The possibility to have control flow is a key selling point of Deep Learning frameworks with Dynamic Computational graphs (ex: PyTorch, Chainer) — a capability also referred to as “Define and Run” [18]:

However, control flow might result in code only piecewise differentiable, a significant complexity overhead [4].

If machine learning models become more like programs, then they will mostly no longer be differentiable — certainly, these programs will still leverage continuous geometric layers as subroutines, which will be differentiable, but the model as a whole would not be. As a result, using backpropagation to adjust weight values in a fixed, hard-coded network, cannot be the method of choice for training models in the future — at least, it cannot be the whole story. [21] We need to figure out to train non-differentiable systems efficiently. Current approaches include genetic algorithms, “evolution strategies”, certain reinforcement learning methods, and ADMM (alternating direction method of multipliers). Naturally, gradient descent is not going anywhere — gradient information will always be useful for optimizing differentiable parametric functions. But our models will certainly become increasingly more ambitious than mere differentiable parametric functions, and thus their automatic development (the “learning” in “machine learning”) will require more than backpropagation. [21]

Note: interestingly, as we will see, the problem of dealing with control flow goes hand in hand with an opportunity: the idea of decoupled deep learning modules (as opposed to end-to-end deep learning monoliths).

In-Place operations

In-place operations, a necessary evil in algorithm design, pose an additionl hazard:

In-place operations pose a hazard for automatic differentiation, because an in-place operation can invalidate data that would be needed in the differentiation phase. Additionally, they require nontrivial tape transformations to be performed. [16]

[16] provides an intuition of how PyTorch deals with in-place operations: invalidation.

Every underlying storage of a variable is associated with a version counter, which tracks how many in-place operations have been applied to the storage. When a variable is saved, we record the version counter at that time. When an attempt to use the saved variable is made, an error is raised if the saved value doesn’t match the current one. [16]

Aliasing

Let’s look at how PyTorch looks at aliasing:

The in-place addition to x also causes some elements of y to be updated; thus, y’s computational history has changed as well. Supporting this case is fairly nontrivial, so PyTorch rejects this program, using an additional field in the version counter (see Invalidation paragraph) to determine that the data is shared [16]

Differentiable programming

One way of viewing deep learning systems is “differentiable functional programming” [8]. Deep Learning has a functional interpretation:

Weight-tying or multiple applications of the same neuron (e.g., ConvNets and RNNs) resemble function abstraction [8] Structural patterns of composition resemble higher-order functions (e.g., map, fold, unfold, zip) [8] [12];

The most natural playground for exploring functional structures trained as deep learning networks would be a new language that can run back-propagation directly on functional programs. [14]

One of the benefits of a higher-level abstraction is the possibility to more easily design infrastructure that tunes model parameters and the hyper-parameters of the model [10], leveraging on hyper-gradients:

The availability of hyper-gradients allow you to do gradient-based optimization of gradient-based optimization, meaning that you can do things like optimizing learning rate and momentum schedules, weight initialization parameters, or step sizes and mass matrices in Hamiltonian Monte Carlo models. [11]; Gaining access to gradients with respect to hyper-paramters opens up a garden of delights. Instead of straining to eliminate hyper-parameters from our models, we can embrace them, and richly hyper-parameterize our models. Just as having a high-dimensional elementary parameterization gives a flexible model, having a high-dimensional hyper-parameterization gives flexibility over model classes, regularization, and training methods. [10]

There are however deeper implications:

It feels like a new kind of programming altogether, a kind of differentiable functional programming. One writes a very rough functional program, with these flexible, learnable pieces, and defines the correct behavior of the program with lots of data. Then you apply gradient descent, or some other optimization algorithm. The result is a program capable of doing remarkable things that we have no idea how to create directly, like generating captions describing images. [9]

I like where this line of thoughts goes: functional programming means functional composability.

Monolithic Deep Learning networks that are trained end-to-end as we typically find today are intrinsically immensely complex such that we are incapable of interpret its inference or behavior. There are recent research that have shown that an incremental training approach is viable. Networks have been demonstrated to work well by training with smaller units and then subsequently combining them to perform more complex behavior. [20]

Decoupled deep learning modules is an exciting research area: Decoupled Neural Interfaces using Synthetic Gradients has shown, for example, very promising results [22]

The road adhead of us

I am not sure if the term Differentiable Programming will stick around. The risk of confusion with Differential Dynamic Programming is high.

The idea, on the other hand, is intriguing. Very intriguing and I am very happy to see projects such as Tensorlang [17] gaining traction.

Wired argued that soon we won’t program computers. We’ll train them like dogs. [23]. Let’s see what happens.

Resources

[0] Griewank, Andreas, and Andrea Walther. “Evaluating derivatives: principles and techniques of algorithmic differentiation”. Vol. 105. Siam, 2008.

[1] Ullrich, Karen, Edward Meeds, and Max Welling. “Soft weight-sharing for neural network compression.” arXiv preprint arXiv:1702.04008 (2017)

[2] Dominic, Steinitz, Backpropogation is Just Steepest Descent with Automatic Differentiation, link

[3] Roger Grosse, Intro to Neural Networks and Machine Learning Lecture notes, link

[4] What is Automatic Differentiation ? link

[5] The gradient and the directional derivative, link

[6] Alexey Radul, Introduction to Automatic Differentiation, link

[7] Havard Berland, Automatic Differentiation, link

[8] Atılım Güneş Baydin, Differentiable Programming, link

[9] Christopher Olah, Neural Networks, Types, and Functional Programming, link

[10] Maclaurin, Dougal, David Duvenaud, and Ryan Adams. “Gradient-based hyperparameter optimization through reversible learning.” International Conference on Machine Learning. 2015.

[11] Hype: Compositional Machine Learning and Hyperparameter Optimization, link

[12] Christopher Olah, Neural Networks, Types, and Functional Programming, link

[13] Pete Warden, Deep Learning is Eating Software, link

[14] David Dalrymple, Differentiable Programming, link

[15] Hascoet, Laurent, and Valérie Pascual. “The Tapenade Automatic Differentiation tool: principles, model, and specification.” ACM Transactions on Mathematical Software (TOMS) 39.3 (2013): 20

[16] Paszke, Adam, et al. “Automatic differentiation in PyTorch.” (2017)

[17] Max Bendick, Designing a Differentiable Language for Deep Learning, link

[18] Carlos Perez, PyTorch, Dynamic Computational Graphs and Modular Deep Learning, link

[19] Andrej Karpathy, Software 2.0, link

[20] Carlos Perez, Deep Teaching: The Sexiest Job of the Future, link

[21] François Chollet, The future of deep learning, link

[22] Jaderberg, Max, et al. “Decoupled neural interfaces using synthetic gradients.” arXiv preprint arXiv:1608.05343 (2016).

[23] Jason Tanz, “The End of Code”, Wired Magazine, 2016, link

[24] Conal Elliott, What is automatic differentiation, and why does it work?, link

[25] Daniel Brice, Automatic Differentiation is Trivial in Haskell, link