The Cost of Going it Alone

9:38 pm

Dave Neary

These are speaker notes from a presentation I gave at the Desktop Summit 2011, on a topic I’ve written about in the past. The slides for the presentation are on the conference website (PDF link).

I’m going to talk about the costs associated with modifying and maintaining free software “out of tree” – that is, when you don’t work with the developers of the software to have your changes integrated. But I’m also going to talk about the costs of working with upstream projects. It can be easy for us to forget that working upstream takes time and money – and we ignore that to our peril. It’s in our interests as free software developers to make it as cost-effective as possible for people to work with us.

Hopefully, if you’re a commercial developer, you’ll come away from this article with a better idea of when it’s worthwhile to work upstream, and when it isn’t. And if you’re a community developer, perhaps this will give you some ideas about how to make it easier for people to work with you.

Softway

In 1996 and 1997, Softway worked on bringing a POSIX API to Windows NT. This involved major patches to all the components of the GCC toolchain – the compiler, linker, assembler, debugger, etc. To make the changes they needed, they hired an 18 year industry veteran of compilers and operating systems, and over the course of 6-8 months, the main body of work was done.

Conscious of the costs involved in maintaining that much work out-of-tree, Softway approached upstream developers to propose that their changes be integrated. The reactions ranged from “this is great, but…” to “NT? Don’t care about it”.

After this initial failure, Softway turned to the GCC company at the time: Cygnus Solutions. Cygnus had hired many GCC maintainers, and at that time if you wanted anything done with GCC, they were the guys to talk to. Their quote? $140,000. And they wouldn’t be able to start work on the project for 14 months.

Deciding this was too expensive, Softway eventually hired another company called Ada Core, maintainers of the Ada front-end for GCC, to rework the patches and get them upstream. Ada Core cost $40,000, and could start next week. That’s roughly the same amount of money as was originally spent developing the features in question.

Getting Things Done

At the highest level, the question that people want to answer is: “How can I get what I want done, as quickly and cheaply as possible?” This software exists, it does 80% of what I need, I need to change it a little bit to fit those needs. What’s the best way for me to do that?

The most common strategy is to pick a release on which to build on, and hack away from there. In fact, in probably 90% of cases that’s as far as it goes. I add the features I need for the project I’m working on Right Now, ship it, and never even talk to upstream about what I did, or why.

The costs involved in this approach are all the “stuff” that gets added upstream (features, bug fixes and security patches) which you never see. Mal Minhas, former CTO of the LiMo Foundatioon, labelled this “unleveraged potential” (PDF link) – missing out on the work of other people who are doing things in your interests. The underlying assumption is that you will end up redoing at least some of this work during your maintenance of your own work.

To avoid missing out on this work, it’s recommended to merge regularly changes from upstream into your local copy of the upstream package. But this merge is typically not free – and the bigger your changes, the more likely it is you will find significant conflicts between upstream and your local copy. There is also an additional, often-forgotten overhead involved with regression testing and validation post-merge. Every time I upgrade a component to a new version, I need to verify that it hasn’t broken anything, either because of changed behaviour at integration points I’m using or simply because some regressions were introduced when fixing other issues.

And the worst part about this maintenance cost is that you will incur it every time you upgrade. Almost every time you pull code from upstream, you will have substantial costs involved in merge resolution and regression testing. And to keep the delta between your code and upstream as small as possible, you should do these merges regularly.

Inevitably, someone will suggest that the maintenance costs have grown to the point where it’s worth your while “giving back” (where this is often a synonym for “dumping our stuff upstream”). The goal is to reduce the delta to a point where only client- or project-specific patches are maintained out of tree, and anything which might be useful to someone else is sent back to the upstream project, to be maintained there. Jim Zemlin recently said in an interview “Let me tell you, maintaining your own version of Linux ain’t cheap, and it ain’t easy”… in his words, “if some aren’t giving back as much as others today, I just think it will naturally happen over time. It always is in their business interest to do so”.

It’s at this point that you will run into the community overhead.

Community overhead

Open Source developers expect contributors to jump through all sorts of hoops to get their code upstream. Most maintainers will request that you re-format patches according to community norms, submit patches which apply cleanly to the head of the development branch, and may suggest alternative approaches to how you should write your patch or feature. Jonathan Corbet has described how this works with the kernel community:

A patch of any significance will result in a number of comments from other developers as they review the code. Working with reviewers can be, for many developers, the most intimidating part of the kernel development process […] when reviewers send you comments, you need to pay attention to the technical observations that they are making. Do not let their form of expression or your own pride keep that from happening. When you get review comments on a patch, take the time to understand what the reviewer is trying to say. If possible, fix the things that the reviewer is asking you to fix. And respond back to the reviewer: thank them, and describe how you will answer their questions.

To take just one example of how projects expect people to do extra work to contribute, think about what version of a piece of software a company is likely to want to integrate into their product or solution. When we worked on QuteCom, the rule I expected our developers to follow was to use only stable releases of any libraries we depended on, which were included in recent releases of major distributions, and were present in Debian testing. I didn’t want my guys to be debugging unstable versions of software written by other people, we had our own stuff to get out the door. And by requiring that components be present in releases of popular distros, I felt I was lowering the bar to participation, by allowing community members to get started by installing devel packages from the distro, rather than downloading and compiling dependencies.

So what happens when you make some changes to the projects you depend on? Those changes are made to a stable version of the software. If your product release will take several months, it is likely that upstream will have moved on. And we expect patches against development branches, not against stable releases. So any work that I do needs to be merged first with the development branch, before it’s submitted.

A developer is left with some choices, none without costs or risk:

Ignore upstream completely, scratch your own itch, and sell upgrades to your end client to pay for maintenance costs – which wastes developer time and does not benefit upstream at all, not to mention leaving your software open to security issues and bugs as they are discovered and fixed upstream.

Fork a vendor branch off a stable release, and “when the project is finished” work on merging it upstream – but by that time, other projects have come along, there’s a substantial cost involved in rebasing your work to the latest and greatest upstream – “when I have time” doesn’t happen very often in the Real World. One way to get around this is to hire someone from upstream to “take care” of getting your code upstream – as we have seen with Softway, this can be a significant financial and time investment. But it can almost guarantee that your code will get upstream.

guarantee that your code will get upstream. The ideal situation – work on a feature branch on the development branch of upstream, and back-port your work to the slow-moving stable version, submitting it for inclusion upstream as soon as it’s finished. Martin Fowler recently pointed out some of the problems with feature branches, but they’re much better than big merges and code dumps. The cost here is that you’re paying up-front the merge cost by making small frequent merges, and adding an extra overhead keeping two branches in sync. Also, there is a significant cost involved in the risk that once you’ve put in all the work, the result will be rejected by upstream developers anyway.

Some stories can illustrate the relative weight of the costs involved in the second and third of these options.

Maemo GTK+

In 2003 or 2004, Nokia started working on modifications to GTK+ for its Nokia 770 tablet. By the time the tablet was released in 2005, the Nokia delta to GTK+ was tens of thousands of lines of code. In addition, a set of mobile-only widgets had been packaged into the Hildon package, which depended on Nokia’s vendor version of GTK+. At that point, when the project became public, Nokia wanted to propose these changes for inclusion upstream in GTK+.

To help with this work, Nokia contracted Imendio (now called Lanedo) to help. At the time they started working on the project, the delta was over 50,000 lines of code. Over the course of 4 years, a lot of work was done to reduce this delta, sometimes by re-writing Maemo features to make them acceptable upstream, sometimes by shepherding changes in, and in part by rebasing Maemo’s GTK+ on GTK+ 2.10, which solved some of the problems which needed to be addressed before.

And yet, even after four calendar years and many more man-years of effort, Hildon, the (reduced) set of mobile widgets which many Maemo application developers used, did not work perfectly on top of a stock upstream GTK+. When Nokia made a grant of $50,000 to the GNOME Foundation to be spent to enhance the developer experience for MeeGo developers, integrating Hildon widgets upstream and ensuring that Maemo application developers could easily port their applications to MeeGo was a big part of the winning bid.

A huge amount of this work could have been avoided by following Andrew Morton’s advice, given to a developer of the experimental filesystem Tux3:

Do NOT fall into the trap of adding more and more stuff to an out-of-tree project. It just makes it harder and harder to get it merged.There are many examples of this.

But to ask the question the other way around: could the GTK+ maintainers have done more to facilitate the submission of this work upstream?

Wakelocks

In 2005, Google acquired a little-known, stealth mode company called Android, which we now know was developing a Java- and Linux-based phone operating system. By 2007, when the platform and first Android phones were announced, the company had made significant changes to the Linux kernel for its needs. One of those changes was called wakelocks – system services and kernel drivers could request that the kernel not go to sleep in the absence of user input (thus locking the device awake, giving the name). Matthew Garrett gave a pretty clear description of what wakelocks do, and why they were added to Android (from his point of view as maintainer of power management in the kernel) at last year’s LinuxCon.

Wakelocks allow the system to save battery, even when there are poorly behaved applications running on the system. For a production environment with thousands of applications of varying quality, that makes sense. So in early 2009, a little known kernel developer working on Android, Arve Hjønnevåg, proposed that wavelocks be included in the kernel. To that end, he sent the following mail, along with a big patch, to the Linux kernel mailing list:

The following patch series adds two apis, wakelock and earlysuspend. The Android platform uses the earlysuspend api to turn the screen and some input devices on and off. The wakelock code determines when to enter the full suspend state. These apis could also be useful to other platforms where the goal is to enter full suspend whenever possible.

It’s fair to say that these changes were not initially well understood or well received. The initial reaction was covered at the time by Jonathan Corbet of LWN.

After a few rounds of revisions, the proposal appears to have been dropped. At least, no significant efforts were made to have the patches proposed from March 2009 until early 2010, when a number of things happened that converged to a perfect storm around the issue. At the end of 2010, Greg Kroah-Hartman deleted a number of Android drivers from the staging tree, saying “no one cared about the code”. Partly as a result of that, a number of key kernel figures met in April 2010 at the collaboration summit to discuss “the Android problem” with Google engineers. Soon afterwards (perhaps by coincidence), Arve re-posted a new set of patches, which seemed to be well received.

That thread erupted, however, and 1500 emails and some hurt feelings later, an alternate implementation called suspend blockers by Rafael Wysocki was integrated into the kernel.

All of this took a lot of work from Google, essentially after the feature was finished. According to Ted Ts’o, speaking in August 2010:

Android team members have spent literally hundreds of man hours (my mail folder on the suspend blocker thread has over 1500 mail messages, and is nearly 10MB), and have tried rewriting the patches several times, in an attempt to make them be main-line acceptable.

Chris DiBona, speaking in an interview with Paula Rooney, said that at that time that “there were some developers at Google working on it who “feel burned” by the decision but he acknowledged that the “staffing, attitude and culture” at Google isn’t sufficient to support the kernel crew.”

Clearly, there is a cost involved in trying to submit code to the kernel and to other projects, and there is a significant risk that the code will be rejected, in spite of that effort.

Given the heavy-hitting kernel developers working inside Google, I can’t help but wonder whether things would have gone smoother if Andrew Morton were asked to help shepherd wakelock functionality into the kernel. As Matthew Garrett said in his LinuxCon post-mortem: “Getting code into the kernel is always easier if you have a recognised name associated with it”.

EVMS and IBM

Even when you do everything right, it is possible to have code rejected by upstream – at which point, the best solution is often to suck it up, and port your work over to whatever API was provided for the same problem space by upstream.

In 2000 or 2002, IBM started work on EVMS, the Enterprise Volume Management System/ EVMS was a set of kernel drivers and user space tools which allowed users to manage several virtual disk drives, potentially across several disks and physical partitions. Dan Frye, speaking about the project in August 2002, said it was “a quantum step forward in terms of ease of use, reliability and performance”.

The project was published first on SourceForge in 2001, and was developed in the open, with releases regularly being made against the most recent upstream kernels, as well as for the major distributions.

On October 2nd, 2002, Kevin Corry proposed EVMS for inclusion in the 2.5 kernel. At the time, he wrote:

To make this as simple as possible for you, there is a Bitkeeper tree available with the latest EVMS source code, located at: http://evms.bkbits.net/linux-2.5 This tree is sync'd with the linux-2.5 tree on linux.bkbits.net as of about noon today (Oct 2).

At this point, IBM had done everything right, as far as I can tell. They tracked upstream development, talked about their plans and encouraged feedback, and when they proposed EVMS for inclusion, it was intended to make it as easy as possible to accept it.

At this point, things get a little fuzzy. The end result, though, was that LVM2, an alternative kernel framework for logical volume management developed by a small company called Sistina (later acquired by Red Hat), was merged on October 29th. And on November 5th, Kevin Corry announced a change in direction for EVMS:

As many of you may know by now, the 2.5 kernel feature freeze has come and gone, and it seems clear that the EVMS kernel driver is not going to be included. With this in mind, we have decided to rework the EVMS user-space administration tools (the Engine) to work with existing drivers currently in the kernel, including (but not necessarily limited to) device mapper and MD.

Again, IBM did “the Right Thing”, and ported their tools to the APIs which were in the kernel. Making that decision earned the team a lot of respect among the kernel community. But at the end of the day, it also cost them – all of the wasted work on their kernel drivers, and all of the work porting their tools to new APIs. A back of the envelope calculation, based on the releases of EVMS after that date, suggests that it set the project back ~6 months, and ~18 man-months of work.

When to engage?

So when does it become useful to engage upstream? And, more importantly, is there anything that we, as upstreams, can do to lower that barrier, to make it easier for new companies to engage?

The reality is, if you’re only making small patches to a project, it’s going to take you longer to get your patches upstream than it will to maintain them over time. It’s just not worth your while, unless there is some intrinsic motivation for working with the project.

If you have a moderately sized delta with upstream, and you expect to have to maintain your version over time, then it may be time to start thinking about getting some of that code upstream. There are a couple of approaches you can use: The first is to train up developers working in-house and have them build some relationship capital over time. The second is to hire a company who already employs maintainers to review your code, suggest changes and help shepherd anything that is appropriate upstream, as Softway did with Ada Core. The relative costs of these options will depend on how nice the project is to new developers, how well documented the community processes are, and of course the quality of your code. In any case, at this point you will need to consider the cost involved in rebasing to a later version of upstream, and consider how often you will have to do it.

However, once your delta goes over a certain threshold, you will end up spending a substantial amount of time in maintenance and regression testing. At this point, the repeated long-term cost of conflict resolution and regression testing every time you merge will outweigh the cost of getting code upstream. Again, there are two options – train up one of your developers, or subcontract the upstreaming work. You might even consider killing two birds with one stone, and including mentorship of one or two of your developers in the subcontracting contract, to grow some in-house project developers. It may also be worth head-hunting someone with an existing reputation in the project.

Finally, if you are using a piece of software as a strategic plan of a product portfolio, and you are making substantial changes to it, then you would be insane not to have a maintainer, or at least a senior developer, from the project on payroll. When Samsung decided to include EFL in their phone platform, they hired Rasterman. Google hired Andrew Morton. Red Hat hired many of the GTK+ maintainers of the era when they created Red Hat Advanced Development Labs to develop GNOME back in 2000. Collabora hired Wim Taymans, Edward Hervey and Christian Schaller from the GStreamer project. And the list continues. But if you do hire a maintainer, do so in the knowledge that their primary allegiance may be to “their” project, and not to your company. Being a good maintainer entails doing a lot more than writing code – factor into your schedule that 20% – 30% of their time will be spent on patch review, rolling releases, documenting roadmap plans, chatting on the mailing list, etc.

Let me leave you with this take-away: We will never make it financially interesting for someone who’s done a quick hack to get that upstream, or get them to fix their “bug” the Right Way. It will always be in the interests of companies with a strategic dependency on a piece of software to hire or train someone to be a core member of the community developing that software. But for everything in between, as free software developers, we can help. We can make it easier for companies to figure out who they can hire to train or mentor their developers, or shepherd changes upstream. We can lower the bar to getting features upstream by documenting community norms and practices, by being nicer to new developers on the list, and by instituting a mentoring programme. By improving patch review processes, we can decrease friction and make it nicer and easier to contribute to the project. If it’s easier for a developer to do a merge every three months than it is for him to talk to you, then your project is missing out.