The Linux Kernel: It’s Worth More!

David A. Wheeler

This paper refines Ingo Molnar’s estimate of the development effort it would take to redevelop Linux kernel version 2.6. Molnar’s rough estimate found it would cost $176M (US) to redevelop the Linux kernel using traditional proprietary approaches. By using a more detailed cost model and much more information about the Linux kernel, I found that the effort would be closer to $612M (US) to redevelop the Linux kernel as it existed in 2004. A postscript lists some recalculations since then, showing that these values have grown. In any case, the Linux kernel is clearly worth far more than the $50,000 offered in 2004.

On October 7, 2004, Jeff V. Merkey made the following offer on the linux.kernel mailing list:

We offer to kernel.org the sum of $50,000.00 US for a one time license to the Linux Kernel Source for a single snapshot of a single Linux version by release number. This offer must be accepted by **ALL** copyright holders and this snapshot will subsequently convert the GPL license into a BSD style license for the code.

Many respondents noted that this proposal was unworkable, because it required complete agreement by all copyright holders. Not only would such a process be lengthy, but many copyright holders made it clear in various replies that they would not agree to any such plan. Many Linux kernel developers expect improved versions of their code to be continuously available to them, and a release using a BSD-style license would violate those developers’ expectations. Indeed, it was clear that many respondants felt that such a move would strip the Linux kernel of legal protections against someone who wanted to monopolize a derived version of the kernel. Many open source software / Free software (OSS/FS) developers allow conversion of their OSS/FS programs to a proprietary program; some even encourage it. The BSD-style licenses are specifically designed to allow conversion of an OSS/FS program into a proprietary program. However, the GPL is the most popular OSS/FS license, and it was specifically designed to prevent this. Based on the thread responses, it’s clear that many Linux kernel developers prefer that the GPL continue to be used as the Linux kernel license.

In addition, many people were suspicious about the motives for this offer. Groklaw published an article that mentioned this proposal, and noted that someone with the same name is listed on a patent recently obtained by the Canopy Group. SCO is a Canopy Group company, and I have since confirmed that the patent application refers to the same person. Groklaw later tried to learn more about him. I don’t really know why Merkey made this proposal, and it doesn’t really matter. What’s more interesting to me is the questions that this raised, namely, how much is Linux “worth”? That is a valid question!

In one of the responses, Ingo Molnar calculated the cost to re-develop the Linux kernel using my tool SLOCCount. Molnar didn’t specify exactly which version of the Linux kernel he used, but he did note that it was in the version 2.6 line, and presumably it was a recent version as of October 2004. He found that “the Linux 2.6 kernel, if developed from scratch as commercial software, takes at least this much effort under the default COCOMO model”:

Total Physical Source Lines of Code (SLOC) = 4,287,449 Development Effort Estimate, Person-Years (Person-Months) = 1,302.68 (15,632) (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05)) Schedule Estimate, Years (Months) = 8.17 (98.10) (Basic COCOMO model, Months = 2.5 * (person-months**0.38)) Estimated Average Number of Developers (Effort/Schedule) = 159.35 Total Estimated Cost to Develop = $ 175,974,824 (average salary = $56,286/year, overhead = 2.40). SLOCCount is Open Source Software/Free Software, licensed under the FSF GPL. Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."

After noting the redevelopment cost of $176M (US), Ingo Molnar then commented, “and you want an unlimited license for $0.05M? What is this, the latest variant of the Nigerian/419 scam?”

Strictly speaking, the value of a product isn’t the same as the cost of developing it. For example, if no one wants to use a software product, then it has no value, no matter how much was spent in developing it. The value of a proprietary software product to its vendor can be estimated by computing the amount of money that the vendor will receive from it over all future time (via sales, etc.), minus the costs (development, sustainment, etc.) over that same time period -- but predicting the future is extremely difficult, and the Linux kernel isn’t a proprietary product anyway. Estimating value to users is difficult, and in fact, value estimation is surprisingly difficult to compute directly. But if a software product is used widely, so much so that you’d be willing to redevelop it, then development costs are a reasonable way to estimate the lower bound of its value. After all, if you’re willing to redevelop a program, then it must have at least that value. The Linux kernel is widely used, so its redevelopment costs will at least give you a lower bound of its value.

Thus, Molnar’s response is quite correct -- offering $50K for something that would cost at about $176M to redevelop is ludicrous. It’s true that the kernel developers could continue to develop the Linux kernel after a BSD-style release, after all, the *BSD operating systems do this now. But with a BSD-style release, someone else could take the code and establish a competing proprietary product, and it would take time for the kernel developers to add enough additional material to compete with such a product. It’s not clear that a proprietary vendor could really pick up the Linux kernel and maintain the same pace without many of the original developers, but that’s a different matter. Certainly, the scale of the difference between $176M and $50K is enough to see that the offer is not very much, compared to what the offerer is trying to buy.

But in fact, it’s even sillier than it appears; I believe the cost to redevelop the Linux kernel would actually be much greater than this. Molnar correctly notes that he used the default Basic COCOMO model for cost estimation. This is the default cost model for SLOCCount, because it’s a reasonable model for rough estimates about typical applications. It’s also a reasonable default when you’re examining a large set of software programs at once, since the ranges of real efforts should eventually average out (this is the approach I used in my More than a Gigabuck paper). So, what Molnar did was perfectly reasonable for getting a rough order of magnitude of effort.

But since there’s only one program being considered in this analysis -- the Linux kernel -- we can use a more detailed model to get a more accurate cost estimate. I was curious what the answer would be. So I’ve estimated the effort to create the Linux kernel, using a more detailed cost model. This paper shows the results -- and it shows that redeveloping the Linux kernel would cost even more.

This estimate is what it would cost to rebuild a particular version, and not exactly the same as the effort actually invested into the kernel. In particular, in Linux kernel development, a common practice is to have a “bake-off” where competing ideas are all implemented and then measured; the approach with the best result (e.g., faster) is then used. Bake-offs have much to commend them, but since only one approach is actually included, the effort invested in the alternatives isn’t included in this estimate.

To get better accuracy in our estimation, we need to use a more detailed estimation model. An obvious alternative, and the one I’ll use, is the Intermediate COCOMO model. This model requires more information than the Basic COCOMO model, but it can produce higher-accuracy estimations if you can provide the data it needs. We’ll also use the version of COCOMO that uses physical SLOC (since we don’t have the logical SLOC counts). If you don’t want to know the details, feel free to skip to the next section labelled “results”.

First, we now need to determine if this is an “organic”, “embedded”, or “semidetached” application. The Linux kernel is clearly not an organic application; organic applications have a small software team developing software in a familiar, in-house environment, without significant communication overheads, and allow hard requirements to be negotiated away. It could be argued that the Linux kernel is embedded, since it often operates in tight constraints; but in practice these constraints aren’t very tight, and the kernel project can often negotiate requirements to a limited extent (e.g., providing only partial support for a particular peripheral or motherboard if key documentation is lacking). While the Linux kernel developers don’t ignore resource constraints, there are no specific constraints that the developers feel are strictly required. Thus, it appears that the kernel should be considered a “semidetached” system; this is the intermediate stage between organic and embedded. “Semidetached” isn’t a very descriptive word, but that’s the word used by the cost model so we’ll use it here. It really just means between the two extremes of organic and embedded.

The intermediate COCOMO model also requires a number of additional parameters. Here are those parameters, and their values for the Linux kernel (as I perceive them); the parameter values are based on Software Engineering Economics by Barry Boehm:

RELY: Required software reliability: High (1.15). The Linux kernel is now used in situations where crashes can cause high financial loss. Even more importantly, Linux kernel developers expect the kernel to be highly reliable, and the kernel undergoes extensive worldwide off-nominal testing. While the testing approach is different than traditional testing regimes, it clearly produces a highly reliable result (see the Reliability section of my paper Why OSS/FS? Look at the Numbers!).

the kernel to be highly reliable, and the kernel undergoes extensive worldwide off-nominal testing. While the testing approach is different than traditional testing regimes, it clearly produces a highly reliable result (see the Reliability section of my paper Why OSS/FS? Look at the Numbers!). DATA: Data base size: Nominal (1.0). Typically the Linux kernel manages far larger data bases (file systems) than itself, but it handles them as somewhat opaque contents, so it’s questionable that those larger sizes can really be counted as being much greater than nominal. Handling the filesystems’ metadata is itself somewhat complicated, and does take significant effort, but filesystem management is only one of many things that the kernel does. So, absent more specific data, we’ll claim it’s nominal. If we claim it’s higher, and there’s reason for doing so, that would increase the estimated effort.

CPLX: Product complexity: Extra high (1.65). The kernel must perform multiple resource handling with dynamically changing priorities: multiple processes/tasks running on potentially multiple processors, with multiple kinds of memory, accessing peripherals which also have various dynamic priorities. The kernel must deal with device timing-dependent coding, and with highly coupled dynamic data structures (some of whose structure is imposed by hardware). In addition, it implements routines for interrupt servicing and masking, as well as multi-processor threading and load balancing. The kernel does have an internal design structure, which helps manage complexity somewhat, but in the end no design can eliminate the essential complexity of the task today’s kernels are asked to perform. It’s true that toy kernels aren’t as complex; requiring single processors, forbidding re-entry, ignoring resource contention issues, ignoring error conditions, and a variety of other simplifications can make a kernel much easier to build, at the cost of poor performance. But the Linux kernel is no toy. Real-world operating system kernels are considered extremely difficult to develop, for a litany of good reasons.

TIME: Execution time constraint: High (1.11). Although it doesn’t need to stay at less than 70% resource use, performance is an important design criteria, and much effort has been spent on measuring and improving performance.

an important design criteria, and much effort has been spent on measuring and improving performance. STOR: Main storage constraint: Nominal (1.0). Although there has been some effort to limit memory use (e.g., 4K kernel stacks), Linux kernel development has not been strongly constrained by memory.

VIRT: Virtual machine volatility: High (1.15). The most common processor (x86) doesn’t change that quickly, though new releases by Intel and AMD do need to be taken into account. The Linux kernel is also influenced by other processor architectures, which in the aggregate change quite a bit over time. Even more importantly, the other components of underlying machines (such as motherboards, peripheral and bus interfaces, etc.) change on a weekly basis. Often the documentation is unavailable, and when available, it’s sometimes wrong (which from a developer’s point of view looks like a volatile interface, since it keeps changing). The Linux kernel developers spend a vast amount of time identifying hardware limitations/problems and working around them. What’s worse, there’s a variety of different hardware, and new ones keep arriving. The kernel developers do attempt to control things where they can. For example, while they try to write code that works with a variety of gcc versions, they limit themselves to one compiler (gcc), designate an official gcc version, and try to limit when official gcc versions are changed. But these measures cannot hide the fact that the interface of the underlying machine is actually quite volatile.

TURN: Computer turnaround time: Nominal (1.0). Kernel recompilation and rebooting aren’t interactive, but they’re reasonably fast on 2+ GHz processors. Once the first compilation has occurred, recompilation is usually quite quick for localized changes. Thus, there’s no reason for this to be a penalty.

ACAP: Analyst capability: High (0.86). It appears that the people analyzing the system, identifying the “real” requirements, and the needed design modifications to support them, are significantly better at doing this than the industry average. This analysis tends to be more distributed than in a typical proprietary project, but it obviously still occurs.

AEXP: Applications experience: Nominal (1.0). It’s difficult to determine how much experience with the Linux kernel the software developers of the Linux kernel have. Clearly, if you modify the same program day after day for many years, you’ll tend to become more efficient at modifying it. Some developers, such as Linus Torvalds and Alan Cox, clearly have a vast amount of experience in modifying the Linux kernel. But for many other kernel developers it isn’t clear that they have a vast amount of experience modifying the Linux kernel. In absence of better information, I’ve chosen nominal. This suggests that on average, developers of the Linux kernel have about 3 years’ full-time experience in modifying the Linux kernel. More experience on average would help, and lower the effort estimation somewhat.

PCAP: Programmer capability: High (0.86). Modern kernels such as Linux are complex, creating a strong barrier against attempts to contribute by less capable developers. Would-be contributors must convince the existing experts that their work is worthwhile, so new contributors’ works are normally revised by highly capable developers. Key kernel developers are not accepted as such unless they convince the other, already highly capable developers that they are also capable. Generally only highly capable, above-average developers (75th percentile or more) will be successful at helping to develop the Linux kernel.

VEXP: Virtual machine experience: Nominal (1.0). The x86 processors, which are by far the most popular for the Linux kernel, are relatively stable and kernel developers have a lot of experience with them. But they are not completely stable (e.g., the new 64-bit extensions for x86 and the NX bit), which can also reduce experience slightly. Authors of ports to other processors also tend to be experienced with those processors. On the other hand, most of the kernel’s code is in its hardware drivers, and this hardware often acts as a virtual machine as well as a needed interface. Many driver developers, while experienced in general, often have less experience with the particular component they’re writing a driver for. In particular, many drivers are not written by companies that produce the hardware, and the developers often don’t have good documentation to help them. Sometimes this has helpful side-effects. It can help unify how hardware is handled, since the kernel developers who are writing drivers for several similar peripherals will often develop a way to unify their handling and apparant interface. It can also have aid reliability in the long term, since the driver writers undrerstand how the kernel works (Windows drivers tend to be written by hardware companies who understand their product but have less knowledge about Windows, and since their code is often not peer reviewed by Windows developers, many Windows drivers can cause the entire operating system to crash). But this initial lack of information by Linux kernel developers about the components does increase the effort to develop a driver. What’s worse, hardware components are notorious for not operating as their specifications proclaim, and the kernel’s job is to hide all that. Thus, this is averaged as nominal, and this is probably being generous.

stable (e.g., the new 64-bit extensions for x86 and the NX bit), which can also reduce experience slightly. Authors of ports to other processors also tend to be experienced with those processors. On the other hand, most of the kernel’s code is in its hardware drivers, and this hardware often acts as a virtual machine as well as a needed interface. Many driver developers, while experienced in general, often have less experience with the particular component they’re writing a driver for. In particular, many drivers are written by companies that produce the hardware, and the developers often don’t have good documentation to help them. Sometimes this has helpful side-effects. It can help unify how hardware is handled, since the kernel developers who are writing drivers for several similar peripherals will often develop a way to unify their handling and apparant interface. It can also have aid reliability in the long term, since the driver writers undrerstand how the kernel works (Windows drivers tend to be written by hardware companies who understand their product but have less knowledge about Windows, and since their code is often not peer reviewed by Windows developers, many Windows drivers can cause the entire operating system to crash). But this initial lack of information by Linux kernel developers about the components does increase the effort to develop a driver. What’s worse, hardware components are notorious for not operating as their specifications proclaim, and the kernel’s job is to hide all that. Thus, this is averaged as nominal, and this is probably being generous. LEXP: Programming language experience: High (0.95).

MODP: Modern programming practices: High - in general use (0.91). This program is written in C, which lacks structures such as exception handling, so there is extensive use of “goto” (etc.) to implement error handling. However, the use of such constructs tends to be highly stylized and structured, so credit is given for using modern practices. Some might claim that this is giving too much credit, but changing this would only make the estimated effort even larger.

TOOL: Use of software tools: Nominal (1.0).

SCED: Required development schedule: Nominal (1.0). There is little schedule pressure per se, so the “most natural” speed is followed.

So now we can compute a new estimate for how much effort it would take to re-develop the Linux kernel 2.6:

MM-nominal-semidetached = 3*(KSLOC)^1.12 = = 3* (4287.449)^1.12 = 35,090 MM Effort-adjustment = 1.15 * 1.0 * 1.65 * 1.11 * 1.0 * 1.15 * 1.0 * 0.86 * 1.0 * 0.86 * 1.0 * 0.95 * 0.91 * 1.0 * 1.0 = 1.54869 MM-adjusted = 35,090 * 1.54869 = 54,343.6 Man-Months = 4,528.6 Man-years of effort to (re)develop If average salary = $56,286/year, and overhead = 2.40, then: Development cost = 56286*2.4*4528.6 = $611,757,037

In short, it would actually cost about $612 million (US) to re-develop the Linux kernel.

Why is this estimate so much larger than Molnar’s original estimate? The answer is that SLOCCount presumes that it’s dealing with an “average” piece of software (i.e., a typical application) unless it’s given parameters that tell it otherwise. This is usually a reasonable default; almost nothing is as hard to develop as an operating system kernel. But operating system kernels are so much harder to develop that, if you include that difficulty into the calculation, the effort estimations go way up. This difficulty shows up in the nominal equation - semidetached is fundamentally harder, and thus has a larger exponent in its estimation equation than the default for basic COCOMO. This difficulty also shows up in factors such as “complexity”; the task the kernel does is fundamentally hard. The strong capabilities of analysts and developers, use of modern practices, and programming language experience all help, but they can only partly compensate; it’s still very hard to develop a modern operating system kernel.

This difference is smoothed over in my paper More than a Gigabuck because that paper includes a large number of applications. Some of the applications would cost less than was estimated, while others would cost more; in general you’d expect that by computing the costs over many programs the differences would be averaged out. Providing that sort of information for every program would have been too time-consuming for the limited time I had available to write that paper, and I often didn’t have that much information anyway. If I do such a study again, I might treat the kernel specially, since the kernel’s size and complexity makes it reasonable to treat specially. SLOCCount actually has options that allow you to provide the parameters for more accurate estimates, if you have the information they need and you’re willing to take the time to provide them. Since the nominal factor is 3, the adjustment for this situation is 1.54869, and the exponent for semidetached projects is 1.12, just providing SLOCCount with the option “ --effort 4.646 1.12 ” would have created a more accurate estimate. But as you can see, it takes much more work to use this more detailed estimation model, which is why many people don’t do it. For many situations, a rough estimate is really all you need; Molnar certainly didn’t need a more exact estimate to make his point. And being able to give a rough estimate when given little information is quite useful.

In the end, Ingo Molnar’s response is still exactly correct. Offering $50K for something that would cost millions to redevelop, and is actively used and supported, is absurd.

It’s interesting to note that there are already several kernels with BSD licenses: the *BSDs (particularly FreeBSD, OpenBSD, and NetBSD). These are fine operating systems for many purposes, indeed, my website once ran on OpenBSD. But clearly, if there is a monetary offer to buy Linux code, the Linux kernel developers must be doing something right. Certainly, from a market share perspective, Linux-based systems are far more popular than systems based on the *BSD kernels. If you just want a kernel licensed under a BSD-style license, you know where to find them.*

It’s worth noting that these approaches only estimate development cost, not value. All proprietary developers invest in development with the presumption that the value of the resulting product (as captured from license fees, support fees, etc.) will exceed the development cost -- if not, they’re out of business. Thus, since the Linux kernel is being actively sustained, it’s only reasonable to presume that its value far exceeds this development estimate. In fact, the kernel’s value probably well exceeds this estimate of simply redevelopment cost.

It’s also worth noting that the Linux kernel has grown substantially. That’s not surprising, given the explosion in the number of peripherals and situations that it supports. In Estimating Linux’s size, I used a Linux distribution released in March 2000, and found that the Linux kernel had 1,526,722 physical source lines of code. In More than a Gigabuck, the Linux distribution had been released on April 2001, and its kernel (version 2.4.2) was 2,437,470 physical source lines of code (SLOC). At that point, this Linux distribution would have cost more than $1 Billion (a Gigabuck) to redevelop. The much newer and larger Linux kernel considered here, with far more drivers and capabilities than the one in that paper, now has 4,287,449 physical source lines of code, and is starting to approach a Gigabuck of effort all by itself. If the kernel reaches 6,648,956 lines of code (($1E9/$56286/2.4*12/3/1.54869) ^ (1/1.12)) given the other assumptions it’ll represent a billion dollars of effort all by itself. And that’s just the kernel, which is only part of a working system. There are other components that weren’t included More than a Gigabuck (such as OpenOffice.org) that are now common in Linux distributions, which are also large and represent massive investments of effort. More than a Gigabuck noted the massive rise in size and scale of OSS/FS systems, and that distributions were rapidly growing in invested effort; this brief analysis is evidence that the trend continues.

In short, the amount of effort that today’s OSS/FS programs represent is rather amazing. Carl Sagan’s phrase “billions and billions,” which he applied to astronomical objects, easily applies to the effort (measured in U.S. dollars) now invested in OSS/FS programs.

I’d like to thank Ingo Molnar for doing the original analysis (using SLOCCount) that triggered this paper. Indeed, I’m always delighted to see people doing analysis instead of just guesswork. Thanks for doing the analysis! This paper is not in any way an attack on Molnar’s work; Molnar computed a quick estimate, and this paper simply uses more data to refine his effort estimation further.

Also, I’d like to tip my hat to Charles Babcock’s October 19, 2007 article “Linux Will Be Worth $1 Billion In First 100 Days of 2009”. He noticed that, by my calculations, if the Linux kernel ever reached 6.6 million lines of code, it would be worth more than $1 billion in terms of equivalent, commercial development costs. Using the current size and growth rates of the Linux kernel, he examined the trend lines and found that “Sometime during the first 100 days of 2009, Linux will cross the 6.6 million lines of code mark and $1 billion in value.”

In 2010, researchers re-did the analysis, and found that it had crossed this milestone. Jesus Garcia-Garcia and Ma Isabel Alonso de Magdaleno found that the then-latest version (2.6.30) of the Linux kernel would cost an estimated EUR 1,025,553,430 to re-develop; at the exchange rate of 1.3499 U.S. Dollars per Euro of 2010-02-25 (reported by Yahoo finance), this becomes about $1.4 billion.

The Linux kernel keeps growing; as of March 7, 2011, it would cost approximately $3 billion USD to redevelop using this estimation method.

Of course, the real story isn’t the exact numbers, it’s that instead of disappearing, FLOSS programs like the Linux kernel are thriving.