In 2013, the GNOME Foundation ran a successful campaign that raised funds for enhancing privacy features in the GNOME desktop and application suite. Unfortunately, subsequent changes in the organization left GNOME without a clear plan for how best to use the earmarked funds, so they remain—untouched—in GNOME's bank account. At GUADEC 2016 in Karlsruhe, Germany, the topic of how to utilize the money was revisited, and a plan has now begun to take shape.

The funding-cryptographers problem

To recap, the GNOME fundraising campaign started in December 2012 with the stated aim "to enhance GNOME 3 so that it offers one of the most secure computing environments available." Several privacy-related feature possibilities were listed on the announcement page, including enhanced disk-encryption support, Tor integration, anti-phishing features, and tools to integrate applications with system-wide privacy settings. The target amount was US $20,000, which was reached in late July 2013, a few week's before that year's GUADEC.

The campaign was much like the previous year's fundraiser to improve accessibility features. As was the case with that earlier effort, the project took some time to consider what its options were. One can see the topic recur in many of the discussions posted to GNOME's Foundation list, for example. For whatever reasons, though, a concrete plan for the privacy funds never materialized.

The explanation may be partly organizational in nature. GNOME has been without an executive director since March 2014, when Karen Sandler went to the Software Freedom Conservancy. That meant that the day-to-day duties of running the GNOME Foundation have had to be spread around among various (volunteer) members of the board.

On top of that, much of the board's time in 2014 was rather unexpectedly taken up by the trademark-infringement dispute with Groupon. Handling that extra workload in addition to the board's duties and the tasks normally tackled by an executive director was, no doubt, time-consuming.

But the subject of privacy is also a bit more nebulous than accessibility, further complicating the task of devising a game plan. At GUADEC 2014, a group of project members formed the Safety and Privacy Team and set out to create a list of potential target tasks. The ideas are quite varied in nature and touch a lot of different GNOME components and applications. Finally, it is undeniable that privacy and security features tend to score high on the perceived difficulty scale, and the reality is that $20,000 does not buy much developer time at professional rates (though what such rates are is a matter of debate, of course). So how to spend the money proved just as tricky as what to spend it on.

Breaking the logjam

The upshot, though, is that the raised funds have remained untouched in the GNOME Foundation's accounts, but they have not been forgotten. How to get the ball rolling again was a question posed to the board at GUADEC 2015 (in addition to various side discussions). And, at GUADEC 2016, security and privacy questions came up again in several sessions, including Werner Koch's keynote and Federico Mena Quintero's session on usability problems in the GNOME GnuPG front-end, Seahorse. Invariably, whenever those discussions took place, the unsolved issue of allocating the privacy-campaign funds came up.

Eventually, the board decided that it had to find someone to spearhead the process of moving the privacy work forward. Board member Cosimo Cecchi agreed to take on that task, and he hosted an unconference session on the topic (with a nearly packed room) on the last day of GUADEC 2016. The session began with a status update. Cecchi said there had been talk of paying for a professional security audit of networked GNOME applications (like GNOME Maps, Notes, and Weather) but that, upon further review, it seemed unlikely that this would be the most beneficial use of the funds, since all of those applications are TLS secured and do not handle personally identifiable information. In other words, those applications already consider privacy issues, and the costs of such an audit (which would no doubt be expensive) were deemed to outweigh the potential benefits. The same cannot be said of most of the other feature ideas proposed thus far.

He also noted that there has, in fact, been progress on the privacy front in recent years—the first item on the Safety and Privacy Team's wishlist was application sandboxing, a feature on which GNOME has made tremendous strides forward. He reviewed several of the common suggestions for how to spend the money as well, including sponsoring a series of hackfests, hiring developers, and establishing feature bounties.

For the rest of the session, Cecchi fielded suggestions from the audience on what the project should pursue and how to pursue it. Somewhat regrettably, most of the comments focused yet again on adding to the list of potential new privacy features, with fewer suggestions being made in the way of implementation ideas.

A page on the GNOME wiki now tracks the status of the project as well as the list of potential features and implementation ideas. After GUADEC, the process for moving forward on the privacy fund was discussed in the August 23 and August 30 board meetings. The minutes from the August 23 meeting are online, while the August 30 minutes have not yet been posted. Board member Allan Day did provide a brief update via email, however.

As of now, the feature bounty idea has been dropped, as has the hackfest idea. Instead, the plan is to use the funds to employ several paid interns to work on privacy-related features, with mentoring from GNOME developers. That is expected to provide the most bang for the buck, so to speak. And those internships would be directly sponsored by the GNOME Foundation, not through Google Summer of Code (GSoC), Outreachy, or other such programs—although there is certainly nothing to prevent privacy-related work taking place through other means. Day added that the GNOME community will be asked to help choose the final set of intern projects, although a time frame for the selection and the internships is still under discussion.

So it remains to be seen how many interns will be involved and what the specific projects will be, but the ball does seem to be moving forward again. In recent years, many free-software projects have learned how valuable GSoC and Outreachy interns can be; internships funded directly by GNOME for a specific purpose will, in all probability, prove equally beneficial. And there is always the possibility that the internship program will have ripple effects elsewhere in the project simply by raising the profile of privacy work—there is no shortage of privacy feature ideas, and the GNOME community seems excited to pursue them one way or another.

Whatever the new privacy features end up being, GNOME supporters will surely be pleased to see the fundraiser bear fruit after a series of interruptions.

[The author would like to thank the GNOME Foundation for travel assistance to attend GUADEC 2016.]

Comments (8 posted)

At LinuxCon North America 2016, Daniel German presented some research that he and others have done to extract more fine-grained authorship information from Git. Instead of the traditional line-based attribution for changes, they took things to the next level by looking at individual tokens in the source code and attributing changes to individuals. This kind of analysis may be helpful in establishing code provenance for copyright and other purposes.

German, who is from the University of Victoria, worked on the project with Kate Stewart of the Linux Foundation and Bram Adams of Polytechnique Montréal. It was a "combination of research plus hacking", he said, and the results were fascinating.

Git and blame

Git is in widespread use and not just for software development. Its pervasiveness is "proof enough that it is an excellent tool". It is also a great archival tool for historical research. Each revision gets stored away and can be compared against other versions using diff and similar utilities. That allows users to see some interesting differences between the versions.

The git blame command is often used to determine who changed a particular line or section of a code file. The GitHub web interface to the blame command gives a side-by-side view of authors and the lines they changed, with links to the revisions where the change was made. It is line-based, which is "sufficiently good for most of the tasks we have", German said.

It is now common for authors of text documents to use Git for version control. He uses it when working on papers with his students and other collaborators. All of the advantages that it provides for source code, such as branching, merging, and blaming, can be applied equally well to text, he said.

But Git allows its users to rewrite history; it provides a lot of tools to clean up the commits in a repository before pushing them to the master. He is a software archaeologist, though, so he would like to see more of the raw history. Quoting Indiana Jones, he said that archaeology is a search for facts, not truth. The more facts there are available, the more history that can be reconstructed.

So the history in Git is likely to be incomplete, but there is still a lot there; what can be done with that? The line-oriented information is fine for many uses, but perhaps looking more deeply inside the lines of code would provide other insights. To research that idea, the project first looked at Git itself. The Git project is likely to be one of the better users of the Git tool, he said. Plus, there was a certain symmetry to using Git to recursively study its own repository.

Tokenize

The idea was to tokenize the source code to track changes at that level. The first step was to decide what the tokens are. The equality operator ( == ) is an obvious token, as are function and variable names and the like. Strings and comments might be considered a series of tokens, but the project ultimately decided to treat them as a single token.

He showed an example of a line in the Git source repository that had been originally authored by Linus Torvalds in 2006. It was changed twice in 2014, but those changes simply altered the argument list, whereas a look at the tokens would show the actual tokens that changed and in which commits (as shown on the left). German said that he wanted to be able to get that level of detail, "whether it is good or not is a different question".

A tokenizer was developed for C code that essentially acts as a filter on an input file, producing the tokenized output, which German called a "view" of the file. Each language needs its own tokenizer; beyond C, there are tokenizers for Java and Python. C++ is harder, he said, since the researchers didn't want to build a full-blown parser for the languages.

After early experiments with the Git source, the researchers turned to the Linux kernel. The filter was run on every version of every file that appears in the Linux kernel repository. It is not a difficult thing to do, but it takes a few days to do it. The filter produces a file with one token per line.

Then the token files were matched up with the commits in the kernel repository to create views of those commits that were, naturally, checked into a Git repository that mirrored that of the mainline kernel. The goal was to create a repository with history by token, then all of the Git (and GitHub) tools can be used to look at that history.

The kernel project had no version control system until the adoption of BitKeeper in 2001. Some subsystems used CVS, but Linus Torvalds never liked the existing choices, so the net became the repository and changes were sent to him as patches. But there is a repository that Yoann Padioleau put together that covers 0.01 up until 2001. At that point, the BitKeeper era started, which ran until Git was created in 2005. Thomas Gleixner has a Git repository for the BitKeeper period and Torvalds, of course, has a repository from 2005 on.

The earliest repository has low granularity for changes (as it is based on the release tarballs), while the BitKeeper granularity is generally good; the best commit history comes from the Git-era repository, unsurprisingly. Unlike other repositories, PostgreSQL for example, the Linux Git history is not simply made up of squash commits of features without merges. That allows the history to be followed.

Git allows concatenating these three separate repositories into one, effectively. Git also has some other features that were useful for historical analysis. In particular, the rename detection was valuable for the project, German said.

There is one warning he provided about the data: the "author" in Git terms may not actually be the author of the code. It may be that the Git author is simply passing the code along from elsewhere. In addition, refactoring or moving code around within the tree may credit the wrong person with authorship. Refactoring is an area that needs more research, he said, since even lawyers are not able to decide who holds the copyright for refactored code.

Findings

He then presented some of the findings of the research, which looked at changes up through the 4.7 release. The number of tokens was roughly six times larger than the number of non-blank, non-comment lines of code. The number of people that occur in git blame for 4.7 is roughly the same, though (12,005 by lines, 12,087 by tokens).

He was curious to see what function had remained closest to its 0.01 version. It turns out that skip_atoi() contains the most code from 0.01. That kind of makes sense, he said, since the mechanism to convert a string to an integer hasn't needed much change. He put up a slide (seen on the right) that demonstrated the user interface to look at token-level changes. It lists three commits and the number of tokens changed; each commit has its changes in a different color and you can hover over a change to see what the commit message was.

A file that has not changed a whole lot is ctype.c , which consists of a table that maps character values to their types (white space, digits, letters, etc.). If you look at the line-level git blame output, it would seem that Andre Rosa updated most of the file in 2009, but that is not really the case. It turns out that a const was added and tabs were changed to spaces, which made Rosa the "owner" of much of the file. In fact, the bulk of the file tokens date back to the days before version control for the kernel (called "pre-git" in the slides [SlideShare]). In 1996, though, some additional mappings were added, but the commas are still there from the original file, which calls into question who has the copyright for that piece, he said.

At the beginning of the project, he wondered if token-level attribution would make a real difference in the authorship numbers. As it turns out, he was surprised to find that comparing authorship by lines versus tokens does not change things much. "Some people win, some lose", but it all comes out as a wash. He showed a graph of the year of origin of code in Linux as counted by lines and tokens—the graphed points were nearly identical. He also looked only at the kernel directory to see if the core code showed anything different. Again, he was surprised to see that there was essentially no difference.

He produced "top twenty" lists of committers to both Linux and to just the kernel directory by tokens versus lines and the lists looked much the same. There are some differences and a bit of reordering to be sure, but little that stands out. For Linux, the pre-git commits are 5.15% by token, while only 3.81% by line. Two examples did show some differences, though: Joe Perches made 0.64% of changes when measured by lines (number ten on the list), but did not appear in the top twenty (so less than 0.44%) by tokens; on the flip side, Arend van Spriel was number thirteen by tokens (0.6%) but was not present on the lines list (so less than 0.5%). Results for the top twenty committers in the kernel directory showed much the same for tokens versus lines.

German also reported some overall statistics on tokens in commits, which show that the repository is made up of mostly small changes. For non-merge commits that modified .c or .h files, 9.5% added three or fewer tokens and removed three or fewer. 7% only removed tokens, while 3.8% added and removed exactly one token. Adding or removing up to ten tokens was 22.4% and half of all commits added/removed up to 60 tokens. On the other hand, two commits added or removed more than one million tokens.

To measure "churn", the researchers calculated the number of tokens added and subtracted the tokens removed. 10% of commits had zero churn, while 48% had a positive churn value of ten or less; 26% had a negative churn.

He concluded his presentation by saying that the research had shown that there is not much difference between lines and tokens at the large scale. But on the small scale, doing that analysis can provide a more fine-grained view of the evolution of the code.

In answer to some audience questions, German said that there is no reason this feature could not be built into Git itself if that was desirable. As to the intellectual property and legal ramifications of the work, he was a bit non-committal. The output of this work simply adds more information that the courts can work with when deciding those kinds of cases; it is the job of the courts to find the truth, he said.

The code is not currently available. The team plans to release it as open source in the next three to four months. In the meantime, though, for projects that do not have the scale of the kernel, he is willing to process them and make the repositories available to interested projects.

[I would like to thank the Linux Foundation for travel assistance to attend LinuxCon North America in Toronto.]

Comments (14 posted)

Some of the most important discussions associated with the annual Kernel Summit do not happen at the event itself; instead, they unfold prior to the summit on the planning mailing list. There is value in learning what developers feel needs to be talked about and, often, important issues can be resolved before the summit itself takes place. That list has just hosted (indeed, is still hosting as of this writing) a voluminous discussion on license enforcement that was described by some participants as being "pointless" or worse. But that discussion has served a valuable purpose: it has brought to the light a debate that has long festered under the surface, and it has clarified where some of the real disagreements lie.

It all started when Karen Sandler, the executive director of the Software Freedom Conservancy (SFC), proposed a session on "GPL defense issues." Interest in these issues is growing, she said, and it would be a good time to get the kernel community together for the purposes of information sharing and determining what community consensus exists, if any, on enforcement issues. It quickly became clear that some real differences of opinion exist though, in truth, the differences of opinion within the community may not be as large as they seem. Rather than attempt to cover the entire thread, this article will try to extract some of the most significant points from it.

When to sue

Many emails were expended on the question of when — if ever — the GPL should be enforced in the courts. To many, this is the core point, and many think that lawsuits are almost never in the community's interest; instead, compliance issues should be resolved via constructive engagement with the companies involved. Greg Kroah-Hartman made that point clearly when he said "I do [want companies to comply], but I don't ever think that suing them is the right way to do it, given that we have been _very_ successful so far without having to do that." Linus Torvalds stated his agreement with this position (in classic Linus style) as well.

There are, these developers said, plenty of reasons to avoid taking companies to court, starting with the fact that the legal system is nondeterministic at best and one never knows what the end result will be. The recent outcome in the VMware case was given as an example here; some see it as having made it harder to pursue enforcement actions in the future — though others disagree strongly and expect the forthcoming appeal to change the situation. But nobody was willing to argue that one can go to court and be certain of the outcome, and some fear the consequences of a severely adverse ruling.

The stronger objection to legal action, though, it that it forces the targeted company into a defensive mode, isolates developers who support Linux internally, and, likely as not, estranges the company from the development community for a long time. Many developers said, over the course of the discussion, that it is far better to get companies to change their ways through engagement with their developers; the process may take years but, when it ends well, those companies join the community and become enthusiastic contributors. This process is said to have worked many times over the years; some of the kernel's largest contributors were once on the list of GPL infringers.

The BusyBox lawsuits were cited as an example of how legal action can go wrong. The prevailing (though not universal) opinion seems to be that those suits led to the release of little, if any useful code while driving many companies out of our community and killing the BusyBox project itself. That is the sort of experience that lawsuit-averse kernel developers hope to avoid; as Linus put it: "Lawsuits destroy community. They destroy trust. They would destroy all the goodwill we've built up over the years by being nice."

Implicit in that argument is the claim that license compliance is not a big problem; the current approach is working well and should not be changed. But it is clear that not all developers think that all is well, and there is certainly a contingent that is unwilling to forswear legal action — an action that strikes many as simply giving in to the infringers. As David Woodhouse said: "Without the option of a lawsuit, there *is* no negotiation. The GPL has certain requirements and they are backed up by law. If you don't have that, you might as well have a BSD licence." Many participants in the discussion echoed the thought that reticence to enforce the GPL will lead to its demise in favor of de facto permissive licensing.

Jeremy Allison (an SFC board member), speaking from his experience at the Samba project, argued that lawsuits should remain an option, saying that, despite the fact that the project has never sued an infringing company, the threat of enforcement is the only thing that makes companies listen to him sometimes. He also said that a lawsuit need not necessarily drive a company out of the community; he gave Microsoft, which lost a $1 billion judgment in a suit involving Samba, as an example; despite having been sued successfully, Microsoft is now a significant contributor to Samba.

Supporters of legal action pointed out that, contrary to claims that "we have never had to do that", there have been lawsuits in the past, and the results have not been as bad as some seem to fear. Harald Welte, who has probably done more GPL enforcement in the courts than anybody else, wrote about his experience. He still strongly believes that using the courts can be the right thing to do, and that it need not necessarily push companies out of the community:

The point here is: Legal threats (or sometimes lawsuits) are wake-up calls to companies that _want_ to do the right thing, but who simply haven't been aware of it, or who haven't given it the right attention/priority before. They are *not* upset about enforcement being brought against them.

Companies, he said, appreciate clarity on what compliance with the license actually means, and the only way to get that clarity is through opinions from the courts. He described using the GPL without enforcement as "useless," saying that the BSD license would be a better choice if there is no wish to enforce the terms of the GPL.

In truth, nobody is willing to forswear legal action entirely; even Linus said that there are times when it makes sense. But the example he gave — IBM's use of GPL-infringement charges in the SCO suit — demonstrates that he sees legal action as a move to be made only in extreme situations. In general, one might say that there is a consensus in the community that lawsuits should only be used as the last resort. But there are significant disagreements over when the last resort has been reached.

What is the objective?

Another interesting divide that came to light concerned what the purpose of the GPL and the goal of compliance is. Linus let it be known that he is primarily concerned with maintaining the flow of code contributions:

What I care about is getting code contributions back. That's kind of the whole *point* of the GPLv2. Not the legalese. Growing the source code base by having participation in the project.

If the goal is to increase contributions, then anything that might alienate contributors is to be avoided. But bringing in the largest amount of code is not the primary concern for everybody involved; some are more focused on gaining access to code that vendors have distributed. Matthew Garrett responded to Linus, saying:

That's what you care about. That's not what your users care about. They care about code *availability*, not contribution. They don't care whether their vendor participates upstream. They just care about being able to fix their shitty broken piece of hardware when the vendor won't ship updates.

A non-confrontational approach can work, he said, when the objective is "4 more enterprise clustering filesystems." But if, instead, one wants the next generation of developers to be able to hack on their devices, then manufacturers have to be convinced to ship source for those devices. Projects like OpenWrt exist as the result of previous enforcement actions, he said; if we want to see similar projects coalesce around today's devices, we need to be prepared to enforce the license and get vendors to provide the source.

Linus feels, though, that OpenWrt has been helped far more by the success of the relaxed "open source" approach, which has focused on producing more and better code, over the GPL "hardliners" who are focused on software freedom. The latter approach, he claimed, has had the effect of driving developers and companies away from the GPL and has been counterproductive overall.

Who is trusted to make the decision?

A fair amount of energy went into the question of whether the SFC can be trusted as an agent of enforcement for the kernel. Some developers (notably but not exclusively Linus) worry that the SFC is pursuing its own goals and that the kernel is not at the top of its list of priorities. SFC members, and Bradley Kuhn in particular, have made enough comments about needing to litigate the GPL in many courts to obtain precedents, defending the GPL as a moral necessity, and seeing the kernel as the final GPL battleground to make a number of people nervous. So, for example, Ted Ts'o said: "I've asked Bradley point blank whether the GPL or the long-term health of Linux development was more important to him, and he very candidly said 'the GPL'." Greg put it this way:

You value the GPL over Linux, and I value Linux over the GPL. You are willing to risk Linux in order to try to validate the GPL in some manner. I am not willing to risk Linux for anything as foolish as that.

For his part, Bradley did not deny that assertion, but qualified it somewhat:

You said that you "care more about Linux than the GPL". I would probably agree with that. But, I do care about software freedom generally much more than I care about Linux *or* the GPL. I care about Linux because it's the only kernel in the world that brings software freedom to lots of users.

Bradley and the SFC had many defenders in the discussion, who said that the SFC should be judged by its actions rather than by what Bradley has said. They point out that, rather than being a litigious group, the SFC has only been involved in two lawsuits ever. They said that the SFC is willing to take on the unpleasant task of talking to management and legal departments about compliance issues — something that developers are generally unwilling to do. And, as Jeremy emphasized, the SFC is not pursuing its own agenda, but that of the developers it represents:

The other thing to remember is Bradley isn't the one making the decisions here. It's the developers - *ALWAYS* the developers. Bradley and the Conservancy staff can give advice, but they do not do *anything* without a direct mandate from the copyright holders.

Such testimonials notwithstanding, it is clear that a number of developers feel that the SFC's objectives do not necessarily line up with their own. Getting over that trust barrier could be hard for the organization to do. Karen has said that the SFC will be having meetings with developers over the coming six months in order to answer questions and get feedback on its enforcement activities. It will be interesting to see what sort of changes happen — both within and outside of the SFC — as the result of these meetings.

The current discussion on the list has slowed, which is undoubtedly a relief to everybody involved. There may not have been much that was resolved, but there should, at a minimum, be a better mutual understanding of the issues and concerns involved with GPL enforcement. The area is complex and full of risks — risks that are associated with both action and inaction. Figuring out what the community wants to do about GPL infringement will, if it is possible at all, require more discussions like this one. The prospect may be painful, but the possibility of frustrated developers acting rashly on their own could be even more so.

Comments (79 posted)