We know that one tactic for convincing people to stop doing something you don't want them to do is to tell them nobody else is doing it anymore. Peer pressure is a powerful force, and in the world of technology, there's a particularly strong desire to be seen as current. That could be why we've been seeing reports in the tech press that the use of copyleft licenses, like the GNU General Public License (GPL), is declining in comparison to the use of lax permissive licenses like the Apache or Expat (commonly but unfortunately called MIT) licenses.

All of the articles I've seen making this claim cite the same few corporate "studies" as their primary sources. The evidence they present is not evidence at all, because neither the specific data set nor the methodology used are published. No field of science accepts experimental conclusions that can't be reproduced by others. We shouldn't accept such conclusions in the area of counting license use either.

Counting the licenses used by free software projects may seem straightforward. By definition, all of their code is published in publicly-available repositories, and should carry easy-to-read notices indicating the applicable licenses. But doing it turns out to involve a minefield of potential errors and biases.

I reviewed some of the data likely used by companies counting licenses, and found obvious mistakes. At the time of this writing, openhub.net, operated by the company Black Duck and so almost surely used in its license counting data set, lists GNU Bash as GPLv2-or-later. Bash has been GPLv3-or-later for several years. While it's now been corrected, the site also listed GNU Emacs as GPLv2-only, a license the project has never had. I found these errors on the first two projects I spot-checked. How many more would we find if the full data set were identified?

Even if the inputs were perfect, writing software to count licenses is extremely difficult and requires making many normative choices. These choices need to be disclosed if we're to draw any accurate conclusions. The problems start with deciding what qualifies as a project to count. Do you care whether the code actually works, or whether it's had contributions from more than one person? How do you handle duplication? Projects often change code hosting sites without removing their old home. If you are crawling multiple hosts, is your code smart enough to tell when two programs are the same? Does a forked or slightly modified version count as a separate program? Versions of the same program for different operating systems can conceivably each be under a different license. Do you count them separately?

After you've determined which projects qualify, you have to parse their license information. License notices are not yet predominantly in structured, machine readable formats. They are written by and for humans, with typos and inconsistent formatting that confound automated parsers. When licenses are recognized, there may be several of them. A GPL-covered project can contain files carrying lax permissive license notices, because it is allowable -- and common -- to redistribute such files as part of a copyleft work. Does that add one just to the GPL column, or do you also increment the noncopyleft license columns?

Once you've decided a project qualfies, and have figured out how to represent its license(s), you then have to decide how much weight to give it. Do you care about the size of the codebase? If you don't, then you will count a large package like GNU Emacs as equal to a small node.js library. If you do care, then you have to create categories to better compare apples to apples, and those criteria need to be shared for others to properly understand the results. Do you care about the size of the user base? If you don't, you will count a GitHub repo containing someone's personal configuration files, kindly shared under a free license but really intended only for their personal use, the same as GCC, used as the foundation for billions of dollars in economic value. If you do care, then you need to share how you determined the user base and how that was incorporated.

Counting licenses used across the entire universe of free software is not an easy job. Whether any given article claiming that copyleft is declining is part of an intentional anti-copyleft effort or not, it risks creating a self-fulfilling prophecy by increasing peer pressure against choosing copyleft licenses. As an individual advocate for user freedom, you can make a difference by questioning these claims when you see them.

Ask two questions: First, is the methodology, including the code used to do the counting, published? Second, is the data set published? If the answer to either one of the above is no, then the claim should be ignored entirely. It's no better than an assertion, and the interpretation of the "data" will be like reading tea leaves -- just the author's own confirmation bias from within their particular bubble.

You can avoid the self-fulfilling prophecy by choosing copyleft for your own projects. Individual license choices have a large impact, because they influence the decisions made by future projects based on yours, or that integrate with yours. From my bubble, I see plenty of people continuing to choose copyleft. We interview some of them every month in a blog series. Recently, the Department of Defense chose the Affero GNU GPL as the license for a new project, and plans to use the GPL as the default for its future projects.

You can also help efforts to scientifically collect information about software license usage. Our Free Software Directory is growing into a useful resource for this, and welcomes volunteer contributions. The Software Heritage Project will be extremely useful in this area as well, and there are packages like FOSSology which aim to do the work of license counting with free, auditable software.

In the end, we need to remember that numbers about who chooses which free license may not be that useful or interesting. All of this is part of the same pie as proprietary software, and so increases in noncopyleft use may be trading off with proprietary licenses, not copyleft, and noncopyleft licenses are still free software licenses. If every proprietary license were replaced with a noncopyleft free license tomorrow, that would be an amazing victory for our movement.

Licenses are a means to the end of user freedom. Copyleft remains the best tool we have for achieving and securing that freedom in the context of our current global regimes on copyright, patents, and contracts. We need it now more than ever. Software under noncopyleft licenses is free, but contingent -- future improvements to it can be made proprietary, essentially pulling the rug out from under us. Only copyleft builds a solid, free, foundation. But if we want to measure something, let's focus on metrics of how more or less free we are in our daily, increasingly digital, lives.