How to Prevent the next Heartbleed

This paper analyzes the Heartbleed vulnerability (CVE-2014-0160) in OpenSSL found in 2014. After an introduction and a discussion of why it wasn’t found earlier, this paper focuses on identifying and discussing countermeasures that could have countered Heartbleed-like vulnerabilities. The paper also discusses preconditions, what would reduce the impact of Heartbleed-like vulnerabilities?, applying these approaches, and exemplars. It ends with conclusions and recommendations. My hope is that developers can learn to apply at least some of these techniques in the future to prevent or reduce the impact of future problems. This paper is part of my essay suite Learning from Disaster.

Introduction

The Heartbleed vulnerability is a serious security vulnerability formally identified as CVE-2014-0160 [Heartbleed.com] and described in CERT Vulnerability Note VU#720951. Heartbleed is a vulnerability in OpenSSL, a widely-used toolkit that implements the cryptographic protocol Secure Sockets Layer (SSL) and its successor the Transport Layer Security (TLS). When you use a web browser with an “https://” URL you are using SSL/TLS, and in many cases at least one side (the client or server) uses OpenSSL. The XKCD cartoon Heartbleed Explanation is a great explanation that shows how the vulnerability can be exploited [XKCD], pointing out that it is remarkably easy to exploit.

The impact of the Heartbleed vulnerability was unusually large. Heartbleed affected a huge number of popular websites, including Google, YouTube, Yahoo!, Pinterest, Blogspot, Instagram, Tumblr, Reddit, Netflix, Stack Overflow, Slate, GitHub, Yelp, Etsy, the U.S. Postal Service (USPS), Blogger, Dropbox, Wikipedia, and the Washington Post. UK parenting site Mumsnet (with 1.5 million registered users) had several user accounts hijacked and its CEO was impersonated. A breach at Community Health Systems (CHS), initially via Heartbleed, led to an information compromise that affected an estimated 4.5 million patients [TrustedSec] [Ragan2014]. One paper stated that “Heartbleed’s severe risks, widespread impact, and costly global cleanup qualify it as a security disaster” [Durumeric2014]. Bruce Schneier put it succinctly: “On the scale of 1 to 10, this is an 11” [Schneier2014].

Google and Codenomicon independently found and reported this vulnerability at close to the same time. Rita Mailheau reports, based on work by Ben Grubb from the Sidney Morning Herald, that Neel Mehta and his team from Google Security discovered Heartbleed on 2014-03-21 during a source code review, and that engineers at Finnish company Codenomicon (Antti Karjalainen, Riku Hietamäki, and Matti Kamunen) separately discovered Heartbleed on 2014-04-02 using a new extension (called Safeguard) in their Defensics fuzz testing tool [Mailheau]. There is strong evidence that no attacker conducted widespread scanning for vulnerable servers before the public revelation of Heartbleed on 2014-04-07, since no Heartbleed-basd attacks were found before that date in four large network data taps (though it is possible that targeted attacks occurred before then) [Fisher2014] [Durumeric2014]. The US government has publicly noted that it “had no prior knowledge of the existence of Heartbleed” [[WhiteHouse2014]].

A key reason for Heartbleed’s large impact was that many widely-used tools and techniques for finding such defects did not find Heartbleed. This paper discusses specific tools and techniques that could have detected or countered Heartbleed, and vulnerabilities like it, ahead-of-time. I will first briefly examine why many tools and techniques did not find it, since it’s important to understand why many common techniques didn’t work. I will also briefly cover preconditions, impact reduction, applying these approaches, exemplars, and conclusions. This paper does not describe how to write secure software in general; for that, see my book Secure Programming for Linux and Unix HOWTO [Wheeler2004] or other such works. I think the most important approach for developing secure software is to simplify the code so it is obviously correct, including avoiding common weaknesses, and then limit privileges to reduce potential damage. However, here I will focus on ways to detect vulnerabilities, since even the best developers make mistakes that lead to vulnerabilities. This paper presumes you already understand how to develop software, and is part of the larger essay suite Learning from Disaster.

If you’re in a hurry, you can jump directly to the conclusions.

My goal is to help prevent similar vulnerabilities by helping projects improve how they develop secure software. As the fictional character Mazer Rackham says in Orson Scott Card’s Ender’s Game, “there is no teacher but the enemy... only the enemy shows you where you are weak”. Let’s learn from this vulnerability how we can avoid similar vulnerabilities in the future.

Why wasn’t this vulnerability found earlier?

There are many detailed explanations of why the code was vulnerable, e.g., [Cassidy2014]. However, for our purposes we only need to focus on the broader technical reasons that this vulnerability existed and stayed undetected for so long.

This OpenSSL vulnerability was caused by well-known general weaknesses (a weakness is basically a type of potential vulnerability). The key weakness can be classified as a buffer over-read (CWE-126) in the heap, which could happen because of improper input validation (CWE-20) of a heartbeat request message. CWE-126 is a special case of an “out-of-bounds read” (CWE-125), which itself is a special case of “improper restriction of operations within the bounds of a memory buffer” aka “improper restriction” (CWE-119). These are really well-known weaknesses; many tools specifically look for improper restriction of operations within the bounds of a memory buffer. OpenSSL is routinely examined by many tools, too.

Kupsch and Miller specifically examined the Heartbleed vulnerability and identified several reasons this vulnerability was not found sooner, even though people and tools were specifically looking for vulnerabilities like it [Kupsch2014-May]. They even noted that, “Heartbleed created a significant challenge for current software assurance tools, and we do not know of any such tools that were able to discover the Heartbleed vulnerability at the time of announcement.” Here I will emphasize a few of its points and add a few points of my own.

Please note that I am focusing on the technical aspects here. “Of Money, Responsibility, and Pride” by Steve Marquess discusses how the OpenSSL work has been funded in the past (primarily through work-for-hire contracts). They did have some funding, and were even turning away money in a number of cases (see the essay for more). But at the time of Heartbleed there was only one person working on OpenSSL full-time, in spite of the importance of OpenSSL. Since that time, the Core Infrastructure Initiative (CII) has invested money in OpenSSL, and things have gotten better. Obviously money matters, and I'm not discounting that, but many other authors have discussed funding. In this essay, I'm focusing on the technical aspects of Heartbleed (and what we can learn from those aspects).

But first, a few quick comments on terminology:

Overflow . Many people, include MITRE’s CWE, use the term “buffer overflow” to only mean over-writes (writing outside the buffer region), and often use the term even more narrowly to only mean copying information to write past the end of a buffer (see CWE-120). Some other people use the term “buffer overflow” to mean either buffer over-read or buffer over-write, and use the more precise term buffer over-write to specifically mean a write. This matters because the Heartbleed vulnerability allowed improperly reading data, instead of the more common problem of allowing improper writing . I have tried to write this text so that it will be clear no matter which meaning you choose. Heartbleed was an over-read in a buffer stored in the heap.

. Many people, include MITRE’s CWE, use the term “buffer overflow” to mean over-writes (writing outside the buffer region), and often use the term even more narrowly to only mean copying information to write past the end of a buffer (see CWE-120). Some other people use the term “buffer overflow” to mean buffer over-read or buffer over-write, and use the more precise term to specifically mean a write. This matters because the Heartbleed vulnerability allowed improperly data, instead of the more common problem of allowing improper . I have tried to write this text so that it will be clear no matter which meaning you choose. Heartbleed was an over-read in a buffer stored in the heap. TOE or SUT. We need some term for the software we are evaluating. One common term is the Target of Evaluation (TOE); this is the term used by the Common Criteria (ISO/IEC 15408). Another term is System Under Test (SUT). The word “test” often implies that you are executing the program, and not all evaluation processes do this, so I will use the term TOE instead for consistency.

Static analysis

Static analysis tools work without executing the program. The most commonly-discussed type of static analysis tools for finding vulnerabilities are variously called source code weakness analyzers, source code security analyzers, static application security testing (SAST) tools, static analysis code scanners, or code weakness analysis tools. A source code weakness analyzer searches for vulnerabilities using various kinds of pattern matches (e.g., they may do taint checking to track data from untrusted sources to see if they are sent to potentially-dangerous operations). There are various reports that evaluate these tools, e.g., [Hofer2010].

However, it’s known that many widely-used static analysis tools would not have found this vulnerability ahead-of-time:

Coverity: Coverity would not have found it ahead-of-time. They are currently working to improve their tool so it will find similar vulnerabilities in the future, using some very interesting new heuristics [Chou2014]. HP/Fortify: HP/Fortify has posted several public statements about Heartbleed, but I have not found any claims that their static analysis tool would have found this vulnerability ahead-of-time. They did modify their dynamic suite to test for the vulnerability once it was publicly known, but that is not the same as detecting it ahead-of-time. Their lack of claims, when specifically discussing it, lead me to believe that their tool would not have found Heartbleed ahead-of-time. Klocwork: Klocwork would not have detected this vulnerability in its normal configuration [Sarkar2014]. Grammatech: Grammatech’s CodeSonar also could not detect this vulnerability. They are also working on experimental improvements that would find vulnerabilities like it in the future (their approach involves a new warning class called Tainted Buffer Access as well as extensions to their taint propagation algorithm) [Anderson2014]. GrammaTech could do the taint analysis starting at socket buffers, but didn't do it because it was too slow in practice. When they turned it on for the right section of code, it found the problem. I don't think that counts; if you already know where the problem is, then you don't need a tool to find it.

The only static analysis tool I've found so far that existed at the time, and was able to find Heartbleed ahead-of-time without a non-standard or specialized configuration, is the FLOSS tool CQual++. CQual++ is a polymorphic whole-program dataflow analysis tool for C++, inspired by Jeff Foster's Cqual tool (indeed, it uses the same backend solver). Although CQual++ focuses on C++, it is also able to analyze C programs (including OpenSSL). CQual++ is the main tool provided by Oink/Elsa (Oink is a collaboration of C++ static analysis tools; Elsa is the front-end for Oink). Daniel S. Wilkerson reported in Oink documentation that "After the Heartbleed bug came out, someone at a government lab that will not let me use their name wrote me (initially on 18 April 2014), saying: Yes, you are interpreting me correctly. CQual++ found Heartbleed while the [proprietary] tools I tried did not." The paper "Large-Scale Analysis of Format String Vulnerabilities in Debian" by Karl Chen and David Wagner suggests that this toolsuite can be effective at detecting vulnerabilities. However, the same reporter may have also made it clear why CQual++ was not used to find the problem first: "I also applied CQual++ to an important internal project and found it very effective (though a bit difficult to run and interpret) at identifying places where sanitization routines weren't being called consistently."

A fundamental issue is that most of these tools do not guarantee to find all vulnerabilities; most do not even guarantee to find vulnerabilities of any particular kind. Sadly, the terminology about this is confusing, so I will first need to clarify the terminology.

In this paper I will call a software analysis tool incomplete if the tool does not necessarily find all vulnerabilities (of a given kind) in the software being analyzed. Previous versions of this paper, and many people, use the term unsound instead to describe tools that look for vulnerabilities (aka bug-finders) that do not claim to find all vulnerabilities. For example, Bessey et al discuss Coverity’s static analysis tool and say, “like the PREfix product, we were also unsound. Our product did not verify the absence of errors but rather tried to find as many of them as possible... Circa 2000, unsoundness was controversial in the research community, though it has since become almost a de facto tool bias for commercial products and many research projects...” [Bessey2010] This term unsound can cause confusion, because people who develop or use program checkers use the term unsound with a different meaning. One blog post explains why the same term seems to have two conflicting meanings: “most program checkers prove theorems about programs. In particular, most aim to prove programs correct in some respect (e.g. type safety). A theorem prover is sound [if and only if] all the theorems it proves are true... People in the program-checking field are accustomed to this, so they habitually think soundness [means] proving the absence of bugs. But a bug-finder doesn’t aim to prove correctness. Instead, it aims to prove incorrectness: to prove the presence of bugs. It’s sound [if and only if] all the bugs it reports are real bugs - that is, if it has no false positives. False negatives (overlooking bugs) are OK, because they don’t make its claims incorrect.” [ArcaneSentiment2014]

I have adopted the NIST SAMATE SATE V Ockham Sound Analysis Criteria [NIST-Sound] in this paper to eliminate this confusion. In the NIST SAMATE terminology, tools that do not guarantee to find all vulnerabilities (of any particular kind) are termed incomplete. Here’s how NIST differentiates between soundness and completeness: “a site is a location in code where a weakness might occur. A buggy site is one that has an instance of the weakness, that is, there is some input that will cause a violation. A non-buggy site is one that does not have an instance of the weakness, in other words, is safe or not vulnerable... A finding is a definitive report about a site. In other words, that the site has a specific weakness (is buggy) or that the site does not have a specific weakness (is not buggy)... Sound means every finding is correct. [A sound] tool need not produce a finding for every site; that is completeness” [NIST-Sound].

Why are so many source code weakness analyzers incomplete? First, most programming languages are not designed to be easy to analyze, making it more difficult to analyze programs in general. Second, most software is not written to make it easy for static analyzers to analyze them. As a result, complete analysis tools may require a lot of human help to apply to existing programs. In contrast, incomplete analysis tools can be applied immediately to existing programs. They manage this by using heuristics to help them identify likely vulnerabilities and complete their analysis within useful times. However, this presents a major caveat: incomplete source code weakness analyzers often miss vulnerabilities.

Clearly Heartbleed is one of those cases where these incomplete heuristics led to a failure by many static analysis tools to find an important vulnerability. The fundamental reason they all failed to find the vulnerability is that the OpenSSL code is extremely complex; it includes multiple levels of indirection and other issues that simply exceeded these tools’ abilities to find the vulnerability. Developers should simplify the code (e.g., through refactoring) to make it easier for tools and humans to analyze the program, as I discuss further later. A partial and deeper reason is that the programming languages C, C++, and Objective-C are notoriously difficult to statically analyze; constructs like pointers (and especially function pointers) can be difficult to statically manage.

This does not mean that static analyzers are useless. Static analyzers can examine how the software will behave under a large number of possible inputs (as compared to dynamic analysis), and the tool heuristics often limit the number of false positives (reports of vulnerabilities that are not vulnerabilities). But - and this is the important point - the heuristics used by incomplete static analysis tools sometimes result in a failure to detect important vulnerabilities.

Dynamic analysis

Dynamic approaches involve running the program with specific inputs and trying to find vulnerabilities.

A limitation of dynamic approaches is that it’s impossible to fully test any program in human-relevant timetables. For example, a trivial program that adds two 64-bit integers has 2128 possible inputs. Testing all inputs (assuming a 4GHz processor and 5 cycles to test each input) would require 13.5 sextillion years (1.35 x 1022 years). Even massively-parallel computing does not really help. Real programs, of course, have far more complex inputs than this! Thus, dynamic approaches cannot show that a program is secure in a strong sense; all they can show is the absence of vulnerabilities with the tests that were used.

But this does not mean that dynamic approaches are useless. Dynamic approaches can be a very useful way to improve security, as long as their limitations are understood. Of course, dynamic approaches (aka software testing) is a useful approach for finding defects. A general introduction to software testing not specific to security is available in Introduction to Software Testing by Paul Ammann and Jeff Offutt [Ammann2008].

Let me discuss two areas that are widely used, but would fail to find Heartbleed: a mostly-positive test suite and traditionally-applied fuzzers.

Mostly-positive automated test suites

One approach is to create a big automated test suite. Eric S. Raymond and some others have been discussing Heartbleed, and in our discussion he stated that, “I think a lot of people have an intuition that test suites don’t work very well... What I’ve learned since is that the gap is relatively narrow - pushing conventional methods hard enough can get you pretty close to never-break”. I completely agree with him that a good automated regression test suite is powerful, especially for non-security defects. If you don’t have one, create one, full stop, we agree.

However, whether or not a test suite would have found the Heartbleed vulnerability depends on how you create this test suite. The way many developers create test suites, which produce which I call “mostly-positive” test suites, would probably not have found Heartbleed. I will later discuss negative testing, a testing approach that would have worked, but we first need to understand why common testing approaches fail.

Many developers and organizations almost exclusively create tests for what should happen with correct input. This makes sense, when you think about it; normal users will complain if a program doesn’t produce the correct output when given correct input, and most users do not probe what the program does with incorrect input. If your sole goal is to quickly identify problems that users would complain about in everyday use, mostly-positive testing works. Besides, many software developers have a bias to focus on making the program work with correct input, and at most try to handle some error conditions they can easily foresee, so they have a natural tendency to create tests with correct input. Many developers simply don’t think about what happens when an attacker sends input that is carefully crafted to exploit a program.

I will call the approach of primarily creating tests for what should happen with correct input a mostly-positive test suite. Unfortunately, in many cases today’s software regression test suites are mostly-positive. Two widely-practiced test approaches typically focus on creating mostly-positive test suites:

Test-driven development (TDD) is a software development process in which the developer “writes an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test, and finally refactors the new code to acceptable standards” [Wikipedia-TDD]. In nearly all cases the TDD literature emphasizes creating tests for describe what a modified function should do, not what they should not do, and many TDD materials do not even mention creating negative tests. A developer could create negative tests while implementing TDD, but this is unusual in practice for those using TDD. Interoperability testing is a system testing process where different implementations of a standard are connected together to determine if they can connect and interoperate (by exchanging data). Interoperability testing is great for helping developers correctly implement a standard protocol (such as SSL/TLS). However, the other implementations are also trying to comply with the specification, so the other implementation usually will not test for “what should not happen”.

Mostly-positive testing is practically useless for secure software. Mostly-positive testing generally isn’t testing for the right thing! In the Heartbleed attack, like most attacks, the attacker sends data in a form not sent in normal use. TDD and interoperability testing are good things... but you typically need to augment them if your goal is secure software.

Code coverage tools as typically used would not have helped either. Some developers may, in addition, run code coverage tools to see what wasn’t tested, and then add additional tests so that a larger percent of the code is covered by tests. Code coverage tools are actually hybrids of static and dynamic analysis, but for purposes of this paper we will discuss them here. The key questions with code coverage are (1) what specific code coverage measurement(s) are used, and (2) what are their minimum value(s)? These vary, but in practice many people are happy with a test suite that only tests 80%-90% of the code when measured as statements or branches (aka decisions). A very few might press all the way up to 100% coverage of statements or branches. Some, particularly those in the safety community, may use a slightly more rigorous coverage measure such as modified condition/decision coverage (MC/DC). Other coverage measures are possible, but these are the common ones. Test coverage tools for these common measures have some security value, for example, they can sometimes detect malicious software that is waiting for a trigger (since tests will often not include the trigger). They can also check if exception handlers seem to run correctly. But even 100% coverage, as measured by typical code coverage tools, would not have been enough to counter Heartbleed. The Heartbleed vulnerability involved the absence of proper input validation. Fundamentally, a code coverage tool as typically used cannot notice missing code; it can only notice existing statements or branches that are untested.

It is not clear that a less-common approach, program mutation testing, would have worked either. Mutation testing is a different coverage measure and way to develop new tests. As applied to programs, in mutation testing you “mutate” a source program, using a set of mutation operators, to create “mutants”. If a test can detect a difference between a mutant and the original, then the mutant is “killed”. This would not have detected the lack of input validation, because there was no input validation test to mutate. It is conceivable that it might have found the buffer over-read, and there has been some research work on using program mutation testing to detect vulnerabilities (e.g., [Shahriar2008]). You can also mutate input data structures and use those mutated inputs as tests. Mutating input data structures certainly could find the Heartbleed vulnerability. Two approaches I discuss later (thorough negative testing and fuzzing with address checking and standard memory allocator) can be viewed as ways to apply mutation testing to input data structures.

I should note that this is not unique to OpenSSL. CVE-2014-1266, aka the goto fail error in the Apple iOS implementation of SSL/TLS, demonstrated that its testing was also mostly-positive. In this vulnerability, the SSL/TLS library accepted valid certificates (which were tested). However, no one had tested to ensure that the library rejected certain kinds of invalid certificates. If you only check if valid data produces valid results, you are unlikely to find security vulnerabilities, since most attacks are based on invalid or unexpected inputs.

You can find this vulnerability (and similar ones) if you create the test suite using a different approach (negative testing) that I describe below. But first, let’s discuss fuzzing.

Traditionally-applied fuzzers and fuzz testing

Fuzz testing is the process of generating pseudo-random inputs and then sending them to the program-under-test to see if something undesirable happens. The tools used to implement fuzz testing are called fuzzers.

Note that fuzz testing is different from traditional testing; in traditional testing, you have a given set of inputs, and you know what the expected output should be for each input. Traditional testing can be expensive as the number of tests grows, because you have to figure out the expected output. The mechanism that determines the expected output is called an Oracle. The costly problem of invoking an Oracle for a large number of test inputs is sometimes called the Oracle problem.

Fuzz testing approaches the Oracle problem differently, because it only tries to detect “something bad” like a program crash. Fuzz testing makes it easy to try many more input test cases in fuzz testing, by making the output checking much less precise. The fuzzing approach was originally developed by Barton Miller in 1988 at the University of Wisconsin. The “fuzz testing of application reliability” site at http://pages.cs.wisc.edu/~bart/fuzz/ has more information about fuzzing in general. For more about fuzzing, see [Takanen2008] and [Sutton2007].

Fuzzers are often used to help find security vulnerabilities, because they can test a huge number of unexpected inputs. In particular, fuzzers are often useful for finding input validation errors, and Heartbleed was fundamentally an input validation error. Yet typical fuzzers completely failed to find the Heartbleed vulnerability!

Fundamentally, the way fuzzers are typically applied would not have found Heartbleed. Heartbleed was a buffer over-read vulnerability, not a buffer over-write vulnerability. Most fuzzers just send lots of data and look for program crashes. However, while buffer over-writes can often lead to crashes, buffer over-reads typically do not crash in normal environments (my thanks to Mark Cornwell who pointed this out).

Several mechanisms are sometimes used to improve the likelihood of detecting or countering buffer over-write. But again, Heartbleed involved an over-read not an over-write, so some of these additional mechanisms would not help at all. For example, canary-based protection approaches (e.g., ProPolice) and non-executable stacks are designed to counter over-writes - not over-reads. GNU libc’s malloc() has the option MALLOC_CHECK_ ; this uses a less-efficient implementation that tolerates simple memory allocation errors (such as double-free) and tries to detect corruption (e.g., caused by writing past the end of an allocated block). The MALLOC_CHECK_ option is a helpful countermeasure against over-writes, but I have no evidence that it would have detected or countered an over-read like Heartbleed. Similarly, Dmalloc’s fence-post (bounds) checking “cannot notice when the program reads from these areas, only when it writes values.”

Fuzzers can find vulnerabilities like Heartbleed. However, to make that happen, we need to extend the error-detection capabilities that they use (beyond the simple approaches widely used today). One way to extend their error-detection capabilities is to use special address-checking tools that can detect memory problems like over-reads during fuzzing. These special address-checking tools (such as address sanitizer or a guard page system) turn subtle problems into something the fuzzer can detect, such as a crash. These special tools typically require that the program allocate and deallocate memory normally.

It is known that OpenSSL does not directly allocate and deallocate memory directly using standard calls. Instead, it used a caching freelist system internal to OpenSSL to reuse allocated memory. Since OpenSSL does not return (deallocate) memory back to the underlying system once it was done with it (in some cases), special tools could fail to detect some common weaknesses such as use-after-free or double-free that they would otherwise find. It also sometimes prevented some operating system and run-time mitigation mechanisms from working. In short, it is widely agreed that this OpenSSL memory allocation approach prevented many mitigation and weakness detection mechanisms from working.

There have been conflicting reports on whether or not these special tools could have specifically found Heartbleed without code changes in OpenSSL. Kupsch and Miller reported in the April 22, 2014 edition of their paper that OpenSSL uses a custom memory allocator, and that “to a dynamic analysis tool, it appears as if the library is allocating large memory buffers and not returning them, but in reality, it is subdividing these large blocks of memory and returning them for use” [Kupsch2014-April]. This kind of subdivision completely defeats the ability of these special tools to detect over-reads like the one in Heartbleed. Based on this information, older versions of this paper reported that fuzzers would have been unlikely to have found Heartbleed in unmodified OpenSSL, even if some of these special tools were used. However, Chris Rohlf and I have since independently investigated the OpenSSL code. On careful examination it appears that while OpenSSL does have a custom system, this memory subdivision does not occur in OpenSSL. The fact that it is so difficult to even determine what the allocator does is testimony that the memory allocation system itself is too complex! Based on this more recent information, it appears that fuzzing could have found Heartbleed, but only if special tools were used with fuzzers, and Kupsch and Miller have updated their paper [Kupsch2014-May]. This is, however, a minor point. Kupsch and Miller were and are correct that typical fuzz testing would not have found this vulnerability, and that the OpenSSL code countered many mitigation and defect-detection tools.

There has been some speculation that fuzzing hasn’t been done as rigorously for OpenSSL and other cryptographic libraries because encryption greatly reduces effectiveness of fuzz testing unless the fuzzer is given keys and is specially written to attack the library [Uberti]. That might be true. However, nothing prevents anyone from writing fuzzers that are given keys (for purposes of testing). Besides, the Heartbleed vulnerability can be found even without keys. Thus, it’s really the fact that it was an over-read that made traditional fuzzing ineffective.

Some fuzz testing systems are white box fuzz testing systems. These systems typically use static analysis to determine what parts of the program are not tested by earlier fuzz testing, and then develop new inputs to test those previously-untested portions. SAGE (Scalable, Automated, Guided Execution) is an example of a tool that takes this approach [Godefroid2008] [Bounimova2013].

To summarize: Traditionally-applied fuzzers and fuzz testing could not find Heartbleed. As I will soon describe, fuzzing can be effective if special address checking tools and a standard memory allocator are used.

What would counter Heartbleed-like vulnerabilities?

Here is a partial list of tools and techniques that would have countered Heartbleed ahead-of-time (either with certainty or with very high confidence). I will specifically note some free / libre / open source software (FLOSS) where that makes sense to do so.

But first, some caveats:

Do not use just one of these tools and techniques to develop secure software. Developing secure software requires a collection of approaches, starting with knowing how to develop secure software in the first place. Most organizations who want to create secure software at least try to write software in a simple and clear way, enable and heed compiler warning flags, apply source code weakness analyzers, apply multi-person review, run fuzzers, and apply a large automated regression test suite. If you only use one technique, you run the risk of fighting the “last war” instead of the current one. For example, it would be absurd to ignore warning flags, even though warning flags would not have detected Heartbleed. (Yes, sometimes warning flags produce false positives, but in most cases you should modify your code to eliminate the false positives.) That said, when an attack succeeds, it’s important to see how to improve things; otherwise attackers may keep breaking into the software using that same approach. Also, the more general an improvement is, the more likely that same improvement would also counter many other attacks.

use one of these tools and techniques to develop secure software. Developing secure software requires a collection of approaches, starting with knowing how to develop secure software in the first place. Most organizations who want to create secure software at least try to write software in a simple and clear way, enable and heed compiler warning flags, apply source code weakness analyzers, apply multi-person review, run fuzzers, and apply a large automated regression test suite. If you only use one technique, you run the risk of fighting the “last war” instead of the current one. For example, it would be absurd to ignore warning flags, even though warning flags would not have detected Heartbleed. (Yes, sometimes warning flags produce false positives, but in most cases you should modify your code to eliminate the false positives.) That said, when an attack succeeds, it’s important to see how to improve things; otherwise attackers may keep breaking into the software using that same approach. Also, the more general an improvement is, the more likely that same improvement would also counter many other attacks. There is no one master list of the types of tools and techniques that exist. Terminology varies, and different tools do different things. I am co-author of a report that lists various tools and techniques for software assurance [Wheeler2014a], which gives the most complete list I am aware of. For additional information about types of tools and techniques, see [BAH2009] [NIST]. I’ve created this list for this specific paper. However, I will try to be clear about what I mean.

This is certainly not a complete list of ways that Heartbleed could have been detected ahead-of-time, as I mentioned above. I do hope it helps; further suggestions would be welcome!

So given those caveats, what specifically could have countered this vulnerability ahead-of-time? To make this especially useful, I have roughly ordered them in cost order, with the cheapest approaches listed first. It is a really rough order, and some are especially debatable; suggestions on how to improve it are welcome. In many cases the more expensive approaches are more general and can counter many other kinds of vulnerabilities, not just Heartbleed. The subheadings identify in parentheses which use dynamic analysis, static analysis, or a hybrid. The final subsection discusses other approaches that might have worked.

Thorough negative testing in test cases (dynamic analysis)

Negative testing is creating tests that should cause failures (e.g., rejections) instead of successes. For example, a system with a password login screen will typically have many positive regression tests to show that logins succeed if the system is given a valid username and credential (e.g., password). Negative testing would create many tests to show that invalid usernames, invalid passwords, and other invalid inputs will prevent a login. One book defines negative testing as “unexpected or semi-valid inputs or sequences of inputs... instead of the proper data expected by the... code” [Takanen2008 page 24]. There are many ways to do negative testing, including creating specific tests (the focus of this section) and creating semi-random test input (covered in a later section as fuzzing).

Thorough negative testing in test cases creates a set of tests that cover every type of input that should fail. I say every type of input, because you cannot test every input, as explained in the section on dynamic analysis. You should include invalid values in your regression test suite to test each input field (in number fields at least try smaller, larger, zero, and negative), each state/protocol transition, each specification rule (what happens when this rule is not obeyed?), and so on. This would have immediately found Heartbleed, since Heartbleed involved a data length value that was not correct according to the specification. It would also find other problems like CVE-2014-1266, the goto fail error in the Apple iOS implementation of SSL/TLS. In CVE-2014-1266, the problem was that iOS accepted invalid certificates. There were many tests with valid certificates... but clearly not enough tests to check what happened with invalid ones.

In most cases only negative tests, not positive tests, have any value for security. As I noted earlier, what matters about test suites is how you create them. This is probably obvious to many readers of this paper. In particular, I suspect Eric S. Raymond is including these kinds of tests when he discusses the advantages of testing. However, this is not obvious to many software developers. All too many developers and organizations only use a mostly-positive test suite instead. Many developers find it very difficult to think like an attacker, and simply fail to consider widespread testing of inputs that “should not happen”.

One great thing about thorough negative testing is that this can at least be partially automated. You can create tools that take machine-processable specifications and generate lots of tests to intentionally fail it... and then see if the implementation can handle it.

Another great thing about thorough negative testing is that if there’s a standard (which there is in this case), it’s possible to collaboratively develop a separate common test suite as a FLOSS project. Then it’s possible to quickly test all current and future implementations and prevent many problems from getting out to users. I would strongly encourage creating general-purpose test suites for protocols like SSL/TLS; that would reduce effort (people only need to create the test suite once), and it would help increase the security for all implementations (not just one). Individual implementations would still need to supplement the general tests with additional tests, but a common big test suite would be a big help.

Software testing is, in fact, an entire field. There are many different kinds of test approaches and test coverage criteria. I can only summarize testing in this paper. For more general information, again, see Introduction to Software Testing by Paul Ammann and Jeff Offutt [Ammann2008]. But the point still stands: testing with only valid input will fail to find many security-related problems, including Heartbleed.

I do not think that you should depend solely on thorough negative testing, or any other single technique, for security. Negative testing, in particular, will only find a relatively narrow range of vulnerabilities, such as especially poor input validation. Dynamic approaches, by their very nature, can only test an insignificant portion of the true input space anyway. But - and this is key - this approach can be very useful for finding security vulnerabilities before users have to deal with them.

Fuzzing with address checking and standard memory allocator (dynamic analysis)

Unfortunately traditional fuzz testing approaches were not helpful in this case. But there are simple lessons we can learn. Fuzzing would have been much more effective if a special tool called an address accessibility checker had also been used. These kinds of special tools can detect many out-of-bound reads in addition to out-of-bound writes during execution, and can often detect other memory problems as well. They are especially good at detecting when a read or write incrementally goes beyond the end of the buffer, and that is exactly the problem with Heartbleed.

There are a number of special tools that perform some sort of address accessibility checking; every tool has its pros and cons. However, if you haven’t used anything else, I strongly recommend that you check out address sanitizer (ASan).

Address sanitizer (ASan)

Address sanitizer (ASan) was first released in 2012, and is now easily available; it’s just an extra flag (-fsanitize=address) built into the LLVM/clang and gcc compilers. Address sanitizer is nothing short of amazing; it does an excellent job at detecting nearly all buffer over-reads and over-writes (for global, stack, or heap values), use-after-free, and double-free. It can also detect use-after-return and memory leaks. It cannot find all memory problems (in particular, it cannot detect read-before-write), but that’s a pretty good list. Its performance overhead averages 73%, with a 2x-4x memory overhead. This performance overhead is usually fine for a test environment, and it’s remarkably small given how good it is at detecting these problems. Many other memory-detection mechanisms have a far larger speed and memory use penalty, and many guard page tools (described below) can only detect heap-based problems. The one big drawback with ASan is that in current implementations you have to recompile the software to use it; in many cases that is not a problem.

For more about ASan, see the USENIX 2012 paper [Serebryany2012] or the ASan website (http://code.google.com/p/address-sanitizer/). The test processes for both the Chromium and Firefox web browsers already include ASan.

Christopher T. Celi (of NIST) confirmed to me on 2014-07-10 that address sanitizer does detect Heartbleed if an attacking query is made against a vulnerable OpenSSL implementation. He ran OpenSSL version 1.0.1e (released in February 2013), which is known to be vulnerable to Heartbleed. He use gcc (version 4.8+) and its -fsanitize=address flag to invoke address sanitizer. As expected, a normal heartbeat request causes no trouble, but a malicious heartbeat request is detected by ASan, and ASan then immediately causes a crash with a memory trace. In his test suite ASan reported, in its error trace, that the there was an error when attempting a “READ of size 65535”. He comments that, “Though the output is a bit more cryptic than that of Valgrind, ASan is better for testing with a fuzzer as it crashes upon finding an error. Because of the output however, one would have to analyze the specific input that caused the crash a bit more heavily than with Valgrind.” As I note later, he also confirmed that Valgrind works.

Even more importantly, Hanno Boeck confirmed in 2015 that the fuzzing tool american fuzzy lop (afl), when combined with Address Sanitizer, does automatically find Heartbleed. It only took 6 hours on non-fancy hardware, and that is a short time for a fuzzer. To be fair, at the time afl was barely known, harder to use, and had trouble working with Address Sanitzer. He noted that, "A lot of other things have been improved in afl, so at the time Heartbleed was found american fuzzy lop probably wasn't in a state that would've allowed to find it in an easy, straightforward way." [Boeck2015] However, afl has become a remarkably powerful yet easy-to-use fuzzer. It tracks the branches that are taken and how often, then prefers using tests that cover the program differently when it evolves new tests. This is so successful that afl has pulled JPEGs out of thin air [lcamtuf2014]. This suggests that using afl, combined with Address Sanitizer or something similar, is worth considering today.

Other address access detection tools (such as guard pages)

There are other tools that can detect memory access and allocation problems. These include binary simulators (e.g., valgrind), guard page systems (e.g., electric fence), and the CPU-specific bound checking mechanisms (such as Intel Memory Protection Extensions (MPX)). You can use several different tools (possibly on different fuzzer runs). For fuzzing to detect Heartbleed and vulnerabilities like it, the mechanism must be able to detect an over-read (not just an over-write) and eventually lead to a crash or other problem detectable by fuzzing. Some of these approaches have very significant performance overheads; where significant, these overheads can reduce the amount of fuzz testing that can be done in a fixed amount of time.

Binary simulators (such as valgrind and Dr. Memory) indirectly execute a program, while performing additional functions such as tracking memory accesses. A widely-used and widely-respected tool in this category is valgrind; valgrind’s memcheck plug-in can detect a variety of errors including over-reads on the heap. Valgrind works by creating a “synthetic processor” and monitoring execution. There are various plug-ins for valgrind; the memcheck tool tracks if memory is valid (if it has been initialized) and and if it can be accessed (e.g., if it has been allocated or not). Valgrind can be used on programs when you do not have the source code. However, valgrind greatly slows down the program, often 25-50 times, and often increases code size by a factor of 12 [Takanen2008, page 182], but this may be fine for testing. Valgrind’s memcheck is powerful for detecting heap-based vulnerabilities like Heartbleed, but it has an important limitation: Memcheck cannot do bounds checking on global or stack arrays. A tool that works in a similar way to valgrind, but focuses especially on memory access issues, is Dr. Memory. Both valgrind and Dr. Memory are FLOSS. These are very useful tools for finding memory-related errors, especially if you lack source code. However, ASan tends to be better if you have the source code and you want to do dynamic bounds-checking; ASan is much faster, takes less memory, and can do bounds-checking for heap, stack, and global data.

Christopher T. Celi (of NIST) confirmed to me on 2014-07-07 that Valgrind does detect Heartbleed if an attacking query is made against a vulnerable OpenSSL implementation. He ran OpenSSL version 1.0.1e (released in February 2013), which is known to be vulnerable to Heartbleed. In this configuration Valgrind detected an “invalid read” of a region that had been allocated by malloc. The invalid read occurred as expected inside the standard C function memcpy , which was called by tls1_process_heartbeat (which is responsible for receiving a heartbeat and processing a response), which was called by ssl3_read_bytes . Valgrind could also report that the memory was allocated by the standard C function malloc through OpenSSL CRYPTO_malloc , again, as expected. In this particular test he sent a message that was known to trigger the Heartbleed attack. He notes that to have detected this ahead-of-time with Valgrind, “Someone testing the code would likely have to use a fuzzer to assemble the proper bytes of hex to send to the server.” Note that he also confirmed that ASan works.

Many tools use guard pages to detect reads or writes that march over or under a buffer. In these systems, a guard page is added after and/or before the allocated memory; attempts to access the guard page region is trapped and specially responded to (e.g., it may lead to a crash). Often these tools are implemented by intercepting a few heap memory allocation calls (such as malloc). Tools that intercept heap allocations and add guard pages typically do not require source code, which is an advantage, but they can can only detect heap-based problems. Also, many guard page systems have a significant performance overhead in both speed and memory use. For example, Guard Malloc is report to increase execution time by a factor of 100 times or more [Takanen2008, page 181] in addition to a very large memory overhead. These tools primarily focus on detecting access of unallocated memory (including use after free), but they can sometimes detect use before initialization by filling un-initialized memory with unusual values. Examples of such tools include electric fence, Detect Unintended Memory Access (DUMA) (a fork of electric fence), guard malloc, and the OpenBSD malloc; all of these are FLOSS.

Some system memory allocators, such as the OpenBSD malloc, have a built-in guard page mechanism. These would have inhibited or stopped Heartbleed, depending on how it is implemented. In particular, OpenBSD’s malloc implementation supports guard pages. In OpenBSD, the “G” option causes “each page size or larger allocation is followed by a guard page that will cause a segmentation fault upon any access.” This can be combined with the “P” option (the default), which moves allocations within a page (“allocations larger than half a page but smaller than a page are aligned to the end of a page to catch buffer overruns in more cases.”) The OpenBSD mechanism can be enabled for a particular program or even enabled by default across the whole system, and this can protect many situations. The OpenBSD malloc approach is reported to have relatively moderate overhead and yet “caught serious bugs in lots of major software” [OpenBSD-Journal] [Felker2014].

There is a weakness in the OpenBSD malloc mechanism: Even with both G and P enabled, small allocations (half a page or less) are not immediately followed by a guard page. I think it would be even better if the OpenBSD guard page mechanism could insert a guard page immediately after even relatively small allocations, even though this would probably have a serious speed and memory size impact. But even as it is, enabling both G and P means that all allocations larger than half a page are immediately followed by a guard page (subject to alignment limits), and that allocations that are a half a page or less will at most leak half a page. That can be very significant reduction in leak size compared to the 64KiB of the original Heartbleed attack, depending on the page size (often 4KiB).

Memory allocations must be aligned, so guard pages may leak a few bytes at the end depending on the implementation. I suspect ASan would be faster than adding guard page on every allocation, but adding guard pages do not require a recompile in most programs, so there is an advantage to having it. Unfortunately, the popular GNU libc malloc does not include this kind of functionality at all.

Intel Memory Protection Extensions (Intel MPX) or other CPU-specific bounds checking mechanisms might help. MPX adds new registers called bound registers to hold bounds for pointers, and new instructions to manage and use the bounds. MPX is to be released as part of the Skylake architecture, but as of 2014 these CPUs are not available to the public. It will take longer for them to be widely available, and that does not necessarily help non-Intel systems (e.g., smartphones do not usually use an Intel chip).

There are other tools and approaches. The point is that many tools can detect memory over-reads, and using at least one of these tools can make fuzz testing more effective.

Fuzz testing in practice

In general, when using fuzz testing you should turn on as many anomaly detectors as you can. The only detection mechanism used for the first fuzzer was “did the unchanged program crash/hang?” - and many fuzzers still only do that. You should at least enable program assertion checks and create as many assertions as you reasonably can. You might also do additional checking to ensure that the intermediate or final state is valid (for example, sanity-check outputs and examine what files are produced in what directories). But for the purposes of Heartbleed-like vulnerabilities, you should at least turn on invalid memory access detectors like ASan.

Many of these tools, including ASan and guard page based programs, require that the program under test allocate and deallocate memory normally. In particular, the program must not combine multiple allocations into one allocation request (e.g., as is done by a slab allocator or memory slicing implementation). At the least, the program should make it trivial to use a normal allocation approach instead for use in fuzz testing (and test that it works).

It it true that encryption libraries can create special issues for fuzzers [Uberti]. But these issues are easily addressed. As Paul Black has stated to me separately, “a tool based on mutated messages should mutate all parts of the message at all levels: individual bits, before encryption, after encryption, session creation, the whole handshake, [etc.]”. Or as Apostol Vassilev has stated to me separately, “a thorough fuzzer should exercise forbidden state machine transitions”.

The first fuzzers generated truly random data to be sent to a program. However, other methods for creating data can improve fuzzing effectiveness. Most fuzzers can be divided into three categories:

Fully random fuzzers send truly random data to a program. Truly random fuzzers are the easiest to get started with, but they typically only test a small portion of a program that has complex input structures. Examples include fuzz.

fuzzers send truly random data to a program. Truly random fuzzers are the easiest to get started with, but they typically only test a small portion of a program that has complex input structures. Examples include fuzz. Mutation-based aka dumb fuzzers start with sample input data and then modify those samples, often though simple transforms like bit-flipping. Mutation-based fuzzers are the next fastest and cheapest to get started with; since they start with valid inputs, they can often test more of a program than a fully random program. Examples of mutation-based fuzzers include General Purpose Fuzzer (GPF), The Art of Fuzzing (Taof), and ProxyFuzz.

aka fuzzers start with sample input data and then modify those samples, often though simple transforms like bit-flipping. Mutation-based fuzzers are the next fastest and cheapest to get started with; since they start with valid inputs, they can often test more of a program than a fully random program. Examples of mutation-based fuzzers include General Purpose Fuzzer (GPF), The Art of Fuzzing (Taof), and ProxyFuzz. Generation-based aka smart fuzzers are provided with detailed information about how to generated the specific protocol being tested. Generation-based fuzzers tend to be even more thorough, but they require more effort (since you have to create a definition of what needs to be generated). Examples of generation-based fuzzers include SPIKE, Sulley, and the Codenomicon Defensics fuzzer.

There are many other ways to categorize fuzzers, too. Template-based fuzzers use existing traces and fuzz parts of the recorded data. Block-based fuzzers break individual protocol messages down in static and variable parts and fuzz only the variable part. Dynamic Generation/Evolution-based fuzzers learn the protocol of the Target of Evaluation (TOE) by feeding the TOE with data and interpreting its responses, e.g. using evolutionary algorithms. Model-based fuzzers employ a model of the protocol. The model is executed on- or offline to generate complex interactions with the TOE. This enables fuzzing data after a point such as authentication, an important issue for SSL/TLS.

Fuzzing can also be combined with traditional tests, again, so that the fuzzing can go beyond a point like authentication. For more discussion about these fuzzer variations and the use of model-based fuzzers, see [Schieferdecker2012]. Codenomicon also discusses different fuzzing approaches and their coverage. Other relevant papers include “A Model-based Approach to Security Flaw Detection of Network Protocol Implementations” [Hsu2008] Whitebox-based fuzzers examine the program to improve what to fuzz (and thus are really hybrid analysis approaches); these can extend effectiveness, but they require more effort to implement and simply are not necessary to find vulnerabilities like Heartbleed. For more information on fuzz testing, see [Takanen2008] and [Sutton2007].

A lot of work has been going on to improve the coverage of code in fuzz testing. In general, as more code is covered by fuzz testing (as measured as statements or branches), the more likely that fuzz testing will detect a vulnerability if present. Thus, some fuzzers use information about the program being executed to improve fuzzing capabilities. Some tools, such as the FLOSS American fuzzy lop, instrument code to improve code coverage while fuzzing. Microsoft has had good experience with constraint-based whitebox fuzz testing, in which they leverage symbolic execution on binary traces and constraint solving to construct new inputs to a program [Bounimova2013] However, simply sending the data is not enough; a fuzzer would have to have detected that there was a problem, and out-of-bounds reads typically do not cause a crash or other easily-detected problem unless something else has been done.

It is possible that a mutation-based fuzzer could have found Heartbleed, once it is coupled with better fault detection, but a mutation-based fuzzer would probably only find Heartbleed if the starting test cases included a heartbeat message. A generation-based fuzzer, once coupled with better fault detection, would be highly likely to find Heartbleed... but only if (1) it included rules to generate a heartbeat, and (2) it fuzzed lengths as well. For example, the Sulley fuzzing framework automatically computes block lengths, but by default it does not fuzz the lengths to make them incompatible with the data being sent. If you use Sulley, you’ll probably need to set the “block sizers” to be “fuzzable=True”; this creates a more rigorous test (as is probably needed to detect Heartbleed) but it is not the default. Thus, Heartbleed shows that we need to be fuzzing lengths and not assuming that lengths are always being checked properly.

Oh, I should add a quick terminology note. People sometimes use the term “negative testing” to make it appear to be a synonym for fuzz testing [Takanen2008, page xix]. I do not use the term that way. Instead, I use the term negative testing in a broader sense. Still, fuzz testing is a useful approach for negative testing, so much so that I have listed it as a separate category.

It would be possible to send inputs like a traditional fuzzer, but examine the outputs more thoroughly. This approach is discussed later in the section on fuzzing with output examination.

It’s debatable whether or not fuzzers are more expensive than negative testing, but here is my reasoning. One advantage of negative testing is that it is really easy to get started; presuming you already have a test suite, you can just start adding negative tests. More importantly, though, negative tests rapidly give an unambiguous answer as to what caused the problem, and since they require little computing power (compared to fuzz testing) developers can easily re-run a test suite on every patch. In contrast, fuzz testing often requires more computing power and interpretation of results; computing power is cheap, but this factor still slows down feedback to developers. The potentially-faster feedback of negative testing could lead to faster developer detection and fixes. Today a key cost driver is developer time, not computing time; a mechanism that best reduces developer time is really helpful and tends to be less costly. Also, you can make a negative test suite once for a given protocol; you can then easily reuse the test suite on every implementation and every patch of each implementation. Of course, these are not in conflict; it is better to do both negative testing and fuzz testing.

Compiling with address checking and standard memory allocator (hybrid analysis)

What if you want to use a program right now, in situations where it’s really important to counter attackers from unknown potential vulnerabilities? It turns out there is at least one way that could have worked. In addition, it might have provided some early warning of exploitation (a rather late form of detection, but it is detection).

One approach is to use a mechanism that detects (at run-time) attempts to read past the end of an allocated memory region. In this approach, you’re not just changing how tests are run; the idea is that you actually use this version during operation! This requires that the program allocate and deallocate memory normally. In particular, the program must not combine multiple allocations into one allocation request (as is done by a slab allocator or memory slicing implementation).

There are several mechanisms that could detect such things at run-time. These are basically a subset of the detection mechanisms for fuzzing with address checking and standard memory allocator (dynamic analysis), with the additional challenge that speed and memory use are much more important. Here are few examples:

Address sanitizer (ASan). Note that you have to recompile the program to do this. As noted above, ASan is just a flag (-fsanitize=address) in the LLVM/clang and gcc compilers, so this is relatively easy to do in most C software. This has an average overhead of 73% performance, and 2x-4x memory [Serebryany2012]. This is probably not something you’d want to do on a smartphone (few people will want their battery life halved), and many busy websites will not welcome the overhead either. But modern computers have far more performance and memory than in the past, so in some situations this is acceptable... and this is something you can do immediately to counter unknown attacks. ASan is especially powerful at detecting a long list of potential problems, including most invalid buffer accesses (not just this particular kind). ASan is not available on all compilers; it would be a good idea for other C, C++, and Objective-C compilers to add it. Intel Memory Protection Extensions (Intel MPX). As of 2014 these CPUs are not available to the public, and this will not help non-CPU architectures. Memory allocation guard pages. Some debugging systems and system memory allocators make it possible to add unmapped “guard pages” after the allocated memory that prevent both reading and writing. These would have inhibited or stopped Heartbleed, depending on how it is implemented. For example, OpenBSD’s malloc implementation supports guard pages. I think GNU libc and similar runtimes should add something like the OpenBSD malloc guard page mechanism so that over-reads can be countered.

I suspect ASan would be faster than adding guard page on every allocation, but adding guard pages does not usually require a recompile, so there is an advantage to using it instead.

This is really a damage reduction approach, instead of an approach that eliminates the problem. From a security point of view this approach turns a loss of confidentiality into a loss of availability. In many cases, however, this is a good trade-off. Also, this approach makes the problem visible once the system is under attack; once a problem is visible it is usually easy to correct.

This approach can be easily combined with a honeypot or honeynet (my thanks to Vincent Legoll, who pointed this out to me on 2014-05-05). Set up these hardened implementations on honeypot/honeynet systems (systems that should not be used by non-attackers), basically to detect and trap attackers. If an attacker tries to break the software, the software would crash instead, and that could be logged and tracked as especially important. Forensics could then detect some specific zero-day exploitations. I think this could also be done by some logging systems combined with intrusion detection systems; again, if a crash occurs in a hardened crypto library, log it specially. This would make it much easier to detect widespread exploitation of a 0-day attack. Distributions, core infrastructure organizations, and other organizations could establish these across the Internet and help protect us all. This would a relatively late form of detection, but in some cases it would detect attacks before others were attacked.

While this approach doesn’t fully fix the problem, it does provide a powerful mitigation, and can be used as part of a larger detection approach. Some distributions or organizations might want to use these countermeasures in specific situations, or at least make these countermeasures easier to enable.

Changing the code doesn’t cost much effort, and recompiling is usually fairly simple also (when you have the source code). However, the performance loss would be really significant in many settings; it’s like losing part of the hardware performance you paid for. For example, using ASan you lose around half your (speed) performance. Thus, I’m counting this approach as a more expensive solution, to capture the loss of this hardware. In many situations the operational impact would be significant; on smartphones this would reduce speed and battery life, and on popular servers this could slow response and increase electrical power costs. If future CPUs add hardware support for ASan, the speed impact could be reduced significantly (the ASan paper estimates that the speed overhead would go from 73% to about 20%). I would love to see CPU manufacturers explore this.

Focused manual spotcheck requiring validation of every field (static analysis)

The vulnerable code was reviewed by a human, so merely having a single human reviewer was obviously not enough.

However, a variation would have worked - requiring the human (manual) review to specifically check every field to ensure that every field was validated. Checklists sometimes get a bad name in computer security. I suspect one reason is that sometimes checklists are deployed to people who don’t know what they’re doing, who then can’t use them effectively. But expert airplane pilots routinely use checklists, even though they do know what they are doing. If patches are only accepted after they are reviewed using a checklist, and the checklist includes “must show that every untrusted data field is validated”, then it is likely that this vulnerability would have been countered.

I had originally included this approach as part of the approach thorough human review / audit, but this is a different and much lower-cost approach. However, it does require that the reviewer(s) apply it to every patch as they come in; it cannot easily help with a large body of pre-existing code.

Fuzzing with output examination (dynamic analysis)

Fuzz testing traditionally involves sending lots of input to a program and looking for grossly incorrect behavior such as crashes. In fuzzing with output examination, the fuzzing system also examines the TOE output, e.g., to determine if the output is expected, or if it has various anomalies that suggest vulnerabilities. To accomplish this, the fuzzing system is provided additional information about the expected response (e.g., as required by a specification) or lack thereof, e.g., some constraints on the expected TOE output. The TOE response can also be compared to patterns (typically based on heuristics) that suggest vulnerable behavior (such as evidence of a cross-site scripting vulnerability).

This is possible to do with generation-based fuzzers because these kinds of fuzzers are already provided information about the correct sequence of interface input (e.g., of a protocol). This approach simply extends this information to also describe the expected output. The description of expected output need not be exact. It becomes increasingly likely to find existing vulnerabilities by making the output description increasingly precise, but of course, more exacting descriptions require much more effort to create. In some sense this approach stretches fuzz testing back towards traditional thorough negative testing in test cases. Early fuzz testing gave up the idea of knowing exactly what the expected output is (to simplify creating test cases), while this approach re-introduces the idea of examining results more carefully for correct behavior.

This is the approach that Codenomicon used to find Heartbleed. In their approach, they developed an additional mechanism called “Safeguard” inside their Defensics tool. Safeguard analyzes the TOE responses to determine if they matched what was expected. More information about this (at a very high level) can be found in [Codenomicon-How], [Eadicicco], and [Chandrashekar]. Codenomicon originally added this to just one protocol suite, SSL/TLS, but based on their success with this approach they are adding this approach to several other interfaces. I understand they intend to add Safeguard to five more interfaces by the end of June 2014, which clearly indicates that Codenomicon thinks this approach has value.

Codenomicon (particularly Mikko Varpiola) provided more information about Safeguard and SSL/TLS in particular. Safeguard was inspired by examination of an earlier vulnerability, CVE-2012-2388. This vulnerability involved signature handling which was tied to user authentication. A Codenomicon engineer realized that this kind of vulnerability could be detected by a fuzzer if it could detect that certain stages of a protocol could be incorrectly skipped. They then began to develop additional checks to examine the TOE output more rigorously.

Safeguard (at least for SSL/TLS) implements four kinds of checks:

Authentication bypass. This checks if a user with insufficient credentials gets granted access to the resources that they’re not supposed to access. This includes guessing the “right” commands, skipping an authentication phase, or allowing unauthorized access to system / protected resources. This is the one that started the whole approach. Weak encryption warning. This checks if a known weak cryptographic algorithm is accepted. This is especially useful for detecting if a client or server is able to force or downgrade its partner to a weaker cryptographic algorithm. Amplification. This warns if a small amount of data sent to the TOE results in a very large response by the TOE. This is an especially important issue for higher-level protocols (such as DNS) that can be built on connectionless protocols like UDP. US-CERT Alert (TA14-017A), UDP-based Amplification Attacks, notes that “certain UDP protocols have been found to have [responses] much larger than the initial request... [so] a single packet can generate tens or hundreds of times the bandwidth... This is called an amplification attack, and when combined with a reflective DoS attack on a large scale it makes it relatively easy to conduct DDoS attacks.” In the case of Safeguard, at the beginning of a test run they calculate a a baseline amplification table (BAT) of requests sent vs. responses received based on known valid protocol interactions. They then send fuzzed packets, and calculate the bandwidth amplification factor (BAF) for each interaction resulting from the fuzzed message. This allows them to pinpoint issues leading to attacks, such as recent NTP reflection attacks where a small UDP messages from spoofed addresses were used to generate a large responses from NTP server. Many UDP-based protocols are inherently vulnerable to amplification, but some protocols and implementations are more vulnerable than others. Data leakage. If the amplification check’s BAF reaches a certain level, this checks if “something unusual came back” using heuristics such as entropy calculations and string matching. The goal is to determine if the program is getting memory contents, if it has triggered a SQL injection, or in some other way has an output that suggests a vulnerability. Heartbleed was caught with this data leakage check in Safeguard. Heartbleed triggered a warning for a rather large BAF, followed by alarm because of what was detected in its return data.

Mikko Varpiola told me that these seem to be surprisingly useful when fuzzing protocols that carry usernames. These are often processed by a SQL database, and this kind of fuzzing can help detect SQL injection vulnerabilities.

My thanks to the people from Codenomicon for providing information to me on how Safebuard works in Defensics: Steve Hayes, Josh Morin, Bob Sturm, and Mikko Varpiola. Mikko Varpiola, in particular, provided me with a lot of more detailed information on Safeguard. Any mistakes in my paraphrasing of their information are my own.

One advantage of this approach is that you only need to observe the output of the system. You do not need source code or the ability to manipulate the underlying platform, so this can be used to examine systems like routers as black boxes. That is really impressive, and is a significant contrast to fuzzing with address checking (which typically requires source code or at least the ability to manipulate the underlying platform).

A challenge with fuzzing with output examination is that someone must create this additional information about the expected output. Obviously it takes additional time to encode the information about the expected output (since this in addition to the interface information that is already necessary to generate input). Another problem is that determining this information is not easy. Specifications (such as IETF RFCs) are notorious for under-specifying what should happen with incorrect or barely-correct input. It is possible to start with a more rigorous requirement and then add various exceptions or allow more variations, but this can take many iterations involving detailed examinations of TOE output. It is also possible to weakly specify the results, but the more generous the specification, the less likely it is to find vulnerabilities.

I have identified this as somewhat more costly because it requires significant interface-specific analysis to determine what the output requirements should be, and then encode them. However, once this information is encoded it can be reused to test later versions or alternative implementations of the same interface.

Context-configured source code weakness analyzers, including annotation systems (static analysis)

Traditional source code weakness analyzers could not find Heartbleed, because they used general-purpose heuristics that simply didn’t work well enough in this case, in part because of the complexity of the code. It is always best simplify the code where you can, but there is always some minimal complexity based on what you are trying to accomplish, and real humans are unlikely to achieve perfect simplicity anyway. Coverity is developing some new heuristics that they think would detect Heartbleed [Chou2014] ... and good for them! At least one person has implemented similar heuristics using clang [Ruef2014] Indeed, I expect all source code weakness analyzers to improve over time, and thus find vulnerabilities that they didn’t find before. But generic heuristics can only go so far at any point in time; can you go beyond?

The answer is yes, and I call this a context-configured source code weakness analyzer. The basic idea is that you start with a source code weakness analyzer, but you then provide far more information about the program that you are analyzing.

This approach requires much more time than just running a source code weakness analyzer, and this additional information is typically tied to just one specific tool (tying you to that tool). However, if you provide more information about your program, the source code weakness analyzer can do a much better job.

Klocwork has shown that this approach definitely works for Heartbleed [Sarkar2014].

Now let’s talk about annotation systems. There are various ways to provide this additional information to static analysis tools; an annotation system adds this additional information as part of the program itself. One common way is to add an annotation system to the programming language, and then modify the program to use these annotations. These annotations can be added by directly changing the code (using new keywords), added as comments, or added in separate files. Examples of tools or annotation languages for C include Microsoft’s SAL, splint, Deputy, Oink/CQual++, cqual, and Frama-C ANSI/ISO C Specification Language (ACSL). Static analysis tools can check the information from the annotation system on every compilation, providing quick feedback once they are used, and they are not limited to specific input values (i.e., they are not limited by the problems of dynamic analysis). You could easily argue that adding this information (via annotation systems) is really a different technique.

Seriously using these additional annotations to counter vulnerabilities often requires a non-trivial amount of work if you are starting with existing code. There are also many different incompatible annotation systems for C, and there are no standards for them, which further impedes their use. After all, it takes work to add annotations, and those annotations lock you into to a specific tool. Microsoft SAL has additional problems; there is no FLOSS implementation and it is only available on Windows. I think that annotation systems would be much more widely used if there was a single widely-accepted standard annotation notation for each major programming language, including C. It would be hard to get that kind of agreement for languages like C when there isn’t already such a notation. Peter Gutmann has written a post on some of his experiences [Gutmann].

However, annotation systems have many advantages. Annotation systems can find vulnerabilities that simply are not countered by switching to a different language. Also, they are often cheaper than switching to a different language (because you are simply adding additional information to an existing program). Of course, these are not in conflict; you can switch languages and use a code annotation system for the new language.

Multi-implementation 100% branch coverage (hybrid analysis)

Another approach that would probably have detected Heartbleed is 100% branch coverage of alternative implementations. As noted earlier, branch testing cannot detect when input validation code is missing in a particular program. Branch coverage can, however, detect existing untested branches in a different implementation. Striving for a test suite that gives full branch coverage of multiple implementations greatly increases the likelihood that missing validation code and missing exception handling would be detected. Stronger test coverage measures, such as modified condition/decision coverage (MC/DC), would work as well.

Like all coverage approaches, this is fundamentally a hybrid analysis technique. This uses dynamic analysis to run tests... and static analysis to determine which branches (or related coverage measures) have been left untested.

This approach is a somewhat specialized approach for finding vulnerabilities. The test suite must be applied across multiple implementations, all with 100% branch coverage, so it requires multiple implementations to be used at all. What’s more, the more different the implementations, the better. Also, this approach is probably less capable (by itself) at finding security vulnerabilities than other approaches. That’s because there may be many different inputs that follow the same path, yet only a small subset of them might trigger a vulnerability. It also only works if one of the other implementations implements the particular component under test (in SSL/TLS support for the heartbeat is optional) and implements the potentially-missing input validation code.

I have never seen this specific approach discussed in the literature; usually people discuss branch coverage of a single implementation (instead of multiple implementations). Still, it is fair to note that this approach can not only help improve quality, but it could also have found this particular vulnerability.

One trouble: these would not necessarily counter Heartbleed, because much depends on the configuration extensions or annotations used and how they are used. In particular, the output would need to be checked thoroughly enough to detect that a problem occurred. On the other hand, they do not depend on hitting exactly the right input; static analyzers can examine a large number of situations simultaneously.

Multi-implementation 100% branch coverage is more costly than thorough negative testing, primarily because if you have a poor test suite it can take a lot of time to work backwards from a missed branch to figure out how to trigger it. Also, missed branches are often specialized error-handling systems that can be difficult to trigger, or undocumented “can’t happen” branches used as part of defensive design. In addition, the test suite has to grow enough to cover multiple implementations at 100%; many organizations do not even try to grow a test suite to do 100% branch coverage of a single implementation, never mind 100% coverage of multiple implementations.

Aggressive run-time assertions (dynamic analysis)

Software developers could aggressively insert and enable run-time assertions. There is speculation that this might have countered Heartbleed, so I will discuss this possibility here.

A software developer can assert that various value relationships or states must be true. These assertions can then be checked at run-time, at least while testing the software. Nearly all languages have a built-in assert mechanism (or equivalent) that can cause an exception or crash if a condition is not true at run-time at a specific program location. Several languages have more advanced built-in assertion mechanisms for specifying preconditions, postconditions, and invariants that can be checked at run-time (examples include Eiffel’s design-by-contract mechanisms and the Ada 2012 contracts). In some cases the language can optimize some of these assertions away, leaving the assertions it cannot optimize away at run-time. Indeed, an annotation system may be partly implemented statically, and partly implemented dynamically; see my previous comments about annotation systems for their static application.

Temporally Enhanced System Logic Assertions (TESLA) is an even more advanced research approach that allows temporal assertions. You can find further information at the website http://www.cl.cam.ac.uk/research/security/ctsrd/tesla/. Frama-C E-ACSL annotation language is a subset of ACSL; Frama-C can take E-ACSL annotations and cause run-time failures if the annotations are violated. E-ACSL support is in a preliminary state in Frama-C as of May 2014.

There is no doubt that assertions can be an excellent mechanism for detecting invalid states, and invalid states can sometimes be an indicator of a vulnerability.

However, this approach does have some weaknesses when it comes to countering Heartbleed. Neither the original developer nor its reviewer realized that checking the request packet length value was important; since the length check was not included, it is unlikely that the developer would have remembered to add assertions to check for it. This is also a problem for thorough negative testing, but negative testing is easily done by a group separate from those developing functional code, and it is much easier to ensure that (for example) all data fields are checked, so I think negative testing would be more likely to find this specific type of vulnerability. Thus, while aggressive annotations can be a very useful approach for countering vulnerabilities, it is somewhat speculative it would have worked for this particular case.

Note that aggressive run-time assertions work very well with the fuzz testing approach described earlier. Run-time assertions detect very specific problems in program state, and thus create more situations that a fuzzer can detect. Jesse Ruderman expressed the complementary relationship of fuzzers and assertions in a wonderfully pithy way:

Fuzzers make things go wrong.

Assertions make sure we find out.

I have placed this approach as a somewhat more expensive option. For this approach to have detected Heartbleed (without knowing about it ahead of time) would have required very aggressive use of assertions. Adding all those assertions would take significant development time and would typically also impose a significant run-time cost.

Safer language (static analysis)

The underlying cause of Heartbleed is that the C programming language (used by OpenSSL) does not include any built-in detection or countermeasure for improper restriction of buffers (including buffer over-writes and over-reads). Improper restriction can often lead to catastrophic failures, so almost all other programming languages automatically counter improper restriction (e.g., by resizing data structures or by raising an exception when the buffer is exceeded).

If vulnerabilities in a given program can have catastrophic effects, then those choosing its programming language(s) should prefer the options that reduce the likelihood of vulnerabilities. The more catastrophic the effects, the stronger this preference should be. Most programming languages provide at least some direct protections against otherwise-dangerous vulnerabilities, such as improper restriction protection. Some programming languages also provide constructs that are less likely to be misused or are less likely to be incorrectly used. Ideally, a language would prevent all vulnerabilities. It is highly unlikely that a general-purpose language could ever prevent all vulnerabilities, but it is a worthwhile goal for language designers to strive for. There is no “perfectly safe” programming language; instead, there is a continuum, with some languages providing more vulnerability countermeasures than others.

Dangerous languages and why people use them

The most dangerous widely-used languages for security-relevant software are C, C++, and Objective-C. All of these languages provide no built-in restrictions on buffer access, indeed, it takes a non-trivial effort to avoid problems like buffer over-reads and over-writes. Improper restrictions on buffer access continue to be a widely-used type of vulnerability that often have catastrophic effects. Using or switching to almost any other language (other than C, C++, or Objective-C) would completely eliminate buffer-related vulnerabilities, including Heartbleed. This is especially true for C, because it lacks many of the higher-level constructs that make it somewhat easier to avoid buffer-handling problems. Most languages also prevent memory deallocation errors that could lead to vulnerabilities (e.g., automatic garbage collection), and some languages are designed to counter additional vulnerabilities as well. One of the reasons there are so many vulnerabilities in modern systems is the overuse of the C, C++, and Objective-C languages. In fact, some people have proposed banning the use of these languages in security-sensitive code.

However, there are reasons that C, C++, and Objective-C are widely used. The TIOBE Programming Community index measures programming language popularity, and as of April 2014 these languages occupy three of the four top slots (the top 4 languages in order are C, Java, Objective-C, and C++). These reasons include higher performance (in speed and memory use), ease of interface, large libraries, platform preference, and familiarity. Also, switching languages for large existing programs (like OpenSSL) is usually a big effort. Let’s examine a few of these reasons.

Speed and memory performance of alternatives

One oft-cited issue is that programs in C, C++, and Objective-C tend to have noticeably higher speed than programs written in most other languages. In addition, most other languages lack the lower-level mechanisms that are needed if you need to directly interact with hardware (and this direct interaction can also increase speed). If you need the speed, and perhaps the direct interaction, the list of likely languages gets much shorter. And speed sometimes matters in a world of mobile devices (which have limited resources) and massive server farms (where poor performance would make them further heat their environment). The benchmarks game includes some speed analysis of various programs written in various languages [BenchmarksGame]. The posting “Approximate speed classes of programming languages” took that data and grouped languages into different tiers based on their approximate speed [Jplus2014]. No benchmark is perfect, and it is always best to measure performance for a specific situation. Still, I prefer measured numbers to big guesses, and this dataset is representative enough to get started. If performance (as measured as speed) is your most important criteria, and you do not want to write in assembly language, the other options according to that analysis are:

Fortran. Fortran implementations often beat everyone else in performance, especially in its primary niche of numerical calculation. There are many (mostly older) libraries maintained in Fortran. However, it is not clear to me that many people would be willing to switch a lot of code to this granddaddy of all other programming languages. In particular, to my knowledge even modern Fortrans do not have standard mechanisms to interface to lower-level hardware interfaces (for programs that need that).

Ada. Ada was designed to be used for hard real-time system implementation, so it is not surprising that it has good performance and includes mechanisms to access low-level components. Ada is also specifically designed to counter errors, preferably at compile time. Even its syntax is specifically designed to counter errors, and Ada certainly counters buffer over-reads like Heartbleed. Many people do not like Ada, often because Ada’s very strict static type-checking makes it harder to get a program accepted by an Ada compiler. However, this strict type-checking is one of its key mechanisms for compile-time detection of defects; Ada can detect at compile time or run time many problems that other programming languages do not detect. Ada is widely used in some high-assurance areas like air traffic control, railroad control, and the like. Nearly all of the Boeing 777 code is in Ada. The CubeSat program recently created an unintentional real-world test of Ada combined with SPARK. Twelve university CubeSats were launched. Only CubeSat out of those twelve are working properly as of June 2014, the one from Vermont Tech, and it was the only one that used SPARK Ada for its code. Project leader Dr. Carl Brandon says, “The use of SPARK/Ada helped make our software much more reliable than the others.” [AdaCore 2014] No formal specification method nor design methodology was used in the development of the CubeSat software (although SPARK supports them), but the developers did use the SPARK/Ada information flow analysis and proofs of freedom from runtime error [Brandon]. Program correctness is not vitally important in many programs, and in any case Ada is not the solution to every problem. However, Ada is a useful language to consider when correctness is vitally important yet run-time performance cannot be sacrificed.

ATS. This is not a well-known or widely-used programming language, but it did remarkably well in this particular benchmark suite. I should note that ATS is no longer listed in more recent performance charts (I do not know why).

There are many other programming languages, especially if you’re willing to give up a little speed as determined by that benchmark. For example, Go (developed by Google) has good performance. (There’s even been some work on converting C to Go automatically, though currently that work is only focused on translating the compiler, not C in general.) Rust is another programming language you can consider. Java has reasonable performance on modern JITs, once it gets going, but there is a non-trivial startup time. Other languages that look promising by these benchmark metrics include Scala, Free Pascal, Lisp SBCL (Steel Bank Common Lisp), Haskell, C# on Mono, F# on Mono, and OCaml (depending on how you cut off the next tier). Neither the D programming language nor the Nimrod programming language are listed in that benchmark, but they are also designed for efficiency.

Of course, if speed is not critical, there are a huge number of languages available. At least one study suggests that there is no statistical difference in the number of vulnerabilities in programs written with .NET, Java, ASP, PHP, Cold Fusion, and Perl. I often use Python when speed is not important because it has a clean and easily-understood syntax. Other languages, such as Ruby and Clojure, have many fans. Scheme is powerful (and I think the readable Lisp extensions solve the readability problems often noted about Lisp-based languages like Scheme). All of these other languages are safer than C, C++, or Objective-C, in the sense that all of them protect against buffer over-reads by default.

There are just too many programming languages to list, so I’ll stop here. My goal is not to list all alternatives; my goal is to make it clear that there are alternatives.

Performance is not just about speed; memory management approaches can also be important. This is especially true on mobile devices like smartphones. C, C++, and Objective-C do not provide automated garbage collectors; many other languages include them. Developers are generally more productive (in terms of functionality over time) if they do not have to think about memory management, but in some environments that is unrealistic. Drew Crawford has a lengthy discussion about mobile device development, where he states that “automated garbage collectors work well if you have at least six times as much memory as needed, but efficiency can [be] greatly harmed if there is less than four times as much memory. iOS has formed a culture around doing most things manually and trying to make the compiler do some of the easy parts. Android has formed a culture around improving a garbage collector that they try very hard not to use in practice. But either way, everybody spends a lot of time thinking about memory management when they write mobile applications. There’s just no substitute for thinking about memory” [Crawford2013]. Automated garbage collection is deprecated in OS X Mountain Lion v10.8, and will be removed in a future version of OS X; Automatic Reference Counting (ARC) is the recommended approach instead for OS X and iOS [Apple2013]. Again, there are reasons people choose C, C++, and Objective-C.

So why are programs in C, C++, and Objective-C often higher-performance (in speed and memory management) than many alternatives? The answer is, in part, because the languages are designed to be that way. In particular, C is designed to make it possible to write programs that run quickly and work well with limited resources (e.g., little memory). The C rationale states that a key principle in C is “trust the programmer” and that “many operations are defined to be how the target machine’s hardware does it” (which impedes portability but helps performance). Also, C’s performance cost model is transparent, so a C or C++ developer can usually estimate the performance aspects of a construct before using it. Different programming languages provide different levels of abstraction, and languages with higher-level abstractions can sometimes make detailed control more difficult. Indeed, many developers have difficulties estimating the performance aspects of programs written in languages that are significantly higher-level than C. Humans do not always estimate correctly, of course, but it is often hard to achieve good performance if it is hard to estimate performance. Performance transparency is especially important in cryptography, because developers need to counter timing attacks and electrical power attacks (my thanks to Markus Armbruster for pointing out this link between cryptography and performance transparency). So merely obtaining high performance is not enough; it is sometimes necessary to ensure that timing or power variances are small, yet few tools provide these measures. Cryptographic libraries can be written in other languages, of course, but other complications can arise depending on what language is used. Of course, implementations greatly vary in their performance; heavily-optimized compilers and run-times can achieve great performance compared to compilers and run-times that are not as heavily optimized.

Interfacing with other languages

Many developers choose C, C++, or Objective-C to simplify interfacing with other components. Many useful utilities have C interfaces, and most language infrastructures can call libraries written in C. However, many programming language systems have easy ways to both call C routines and to be called by other systems using C interfaces. Thus, this isn’t as important a reason today to choose these languages.

Reducing language risks

Developers using C, C++, and Objective-C can reduce their risks in various ways, such as using less-risky library functions and using language subsets. These merely reduce the risk somewhat, not eliminate it; it is really easy to make a mistake even when using these facilities. Still, risk reduction can be valuable.

Creating library functions that reduce the likelihood of security vulnerabilities, and then using them, can help reduce the risk of vulnerabilities. This is especially true if these functions are part of the standard library for these languages, since these will tend to be widely-understood, portable, and well-supported. Here are a few examples for various languages:

C. There are many options for C, but all current options have problems; in many cases the better options are not widely available. “Secure Coding in C and C++: Strings and Buffer Overflows” by Robert C. Seacord (Apr 24, 2013) (an extract from his book “Secure Coding in C and C++” Second Edition) briefly explains various options for safely handling string buffers in C and C++. The latest C standard (C11) has added many functions to the built-in C library, pa