Deep learning and free software

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

Deep-learning applications typically rely on a trained neural net to accomplish their goal (e.g. photo recognition, automatic translation, or playing go). That neural net uses what is essentially a large collection of weighting numbers that have been empirically determined as part of its training (which generally uses a huge set of training data). A free-software application could use those weights, but there are a number of barriers for users who might want to tweak them for various reasons. A discussion on the debian-devel mailing list recently looked at whether these deep-learning applications can ever truly be considered "free" (as in freedom) because of these pre-computed weights—and the difficulties inherent in changing them.

The conversation was started by Zhou Mo ("Lumin"); he is concerned that, even if deep-learning application projects release the weights under a free license, there are questions about how much freedom that really provides. In particular, he noted that training these networks is done using NVIDIA's proprietary cuDNN library that only runs on NVIDIA hardware.

Even if upstream releases their pretrained model under GPL license, the freedom to modify, research, reproduce the neural networks, especially "very deep" neural networks is de facto [controlled] by PROPRIETARIES.

While it might be possible to train (or retrain) these networks using only free software, it is prohibitively expensive in terms of CPU time to do so, he said. So, he asked: "Is GPL-[licensed] pretrained neural network REALLY FREE? Is it really DFSG-compatible?" Jonas Smedegaard did not think the "100x slower" argument held much water in terms of free-software licensing. Once Mo had clarified some of his thinking, Smedegaard said:

I believe none of the general public licenses (neither liberal nor copyleft) require non-[ridiculous] cost for the freedoms protected. I therefore believe there is no license violation, as long as the code is _possible_ to compile without non-free code (e.g. blobs to activate GPUs) - even if ridiculously expensive in either time or hardware.

He did note that if rebuilding the neural network data was required for releases, there was a practical problem: blocking the build for, say, 100 years would not really be possible. That stretches way beyond even Debian's relatively slow release pace. Theodore Y. Ts'o likened the situation to that of e2fsprogs, which distributes the output from autoconf as well as the input for it; many distributions will simply use the output as newer versions of autoconf may not generate it correctly.

Ian Jackson strongly stated that GPL-licensed neural networks were not truly free, nor are they DFSG compatible in his opinion:

Things in Debian main [should] be buildable *from source* using Debian main. In the case of a pretrained neural network, the source code is the training data. In fact, they are probably not redistributable unless all the training data is supplied, since the GPL's definition of "source code" is the "preferred form for modification". For a pretrained neural network that is the training data.

But there may be other data sets that have similar properties, Russ Allbery said in something of a thought experiment. He hypothesized about a database of astronomical objects where the end product is derived from a huge data set of observations using lots of computation, but the analysis code and perhaps some of the observations are not released. He pointed to genome data as another possible area where this might come up. He wondered whether that kind of data would be compatible with the DFSG. "For a lot of scientific data, reproducing a result data set is not trivial and the concept of 'source' is pretty murky."

Jackson sees things differently, however. The hypothetical NASA database can be changed as needed or wanted, but the weightings of a neural network are not even remotely transparent:

Compare neural networks: a user who uses a pre-trained neural network is subordinated to the people who prepared its training data and set up the training runs. If the user does not like the results given by the neural network, it is not sensibly possible to diagnose and remedy the problem by modifying the weighting tables directly. The user is rendered helpless. If training data and training software is not provided, they cannot retrain the network even if they choose to buy or rent the hardware.

That argument convinced Allbery, but Russell Stuart dug a little deeper. He noted that the package that Mo mentioned in his initial message, leela-zero, is a reimplementation of the AlphaGo Zero program that has learned to play go at a level beyond that of the best humans. Stuart said that Debian already accepts chess, backgammon, and go programs that he probably could not sensibly modify even if he completely understood the code.

[...] Debian rejecting the example networks as they "aren't DFSG" free would be a mistake. I view one of our roles as advancing free software, all free software. Rejecting some software because we humans don't understand it doesn't match that goal.

Allbery noted that GNU Backgammon (which he packages for Debian) was built in a similar way to AlphaGo Zero: training a neural network by playing against itself. He thinks the file of weighting information is a reasonable thing to distribute:

I think it's the preferred form of modification in this case because upstream does not have, so far as I know, any special data set or additional information or resources beyond what's included in the source package. They would make any changes exactly the same way any user of the package would: instantiating the net and further training it, or starting over and training a new network.

However, Ximin Luo (who filed the "intent to package" (ITP) bug report for adding leela-zero to Debian) pointed out that there is no weight file that comes with leela-zero. There are efforts to generate such a file in a distributed manner among interested users.

So the source code for everything is in fact FOSS, it's just the fact that the compilation/"training" process can't be run by individuals or small non-profit orgs easily. For the purposes of DFSG packaging everything's fine, we don't distribute any weights as part of Debian, and upstream does not distribute that as part of the FOSS software either. This is not ideal but is the best we can do for now.

He is clearly a bit irritated by the DFSG-suitability question, at least with regard to leela-zero, but it is an important question to (eventually) settle. Deep-learning will clearly become more prevalent over time, for good or ill (and Jackson made several points about the ethical problems that can stem from it). How these applications and data sets will be handled by Debian (and other distributions) will have to be worked out, sooner or later.

A separate kind of license for these data sets (training or pre-trained weights), as the Linux Foundation has been working on with the Community Data License Agreement, may help a bit, but won't be any kind of panacea. The license doesn't really change the fundamental computing resources needed to use a covered data set, for example. It is going to come down to a question of what a truly free deep-learning application looks like and what, if anything, users can do to modify it. The application of huge computing resources to problems that have long bedeviled computer scientists is certainly a boon in some areas, but it would seem to be leading away from the democratization of software to a certain extent.