In this note we are taking the opportunity to write a retrospective on our paper Compositional Shape Analysis by means of Bi-abduction from the 2009 POPL conference, which received the 2019 POPL MIP award. The award citation summarizes the actual work admirably; part of it reads as follows:

Bi-abduction allows an analysis to determine separation logic pre-conditions under which the procedure executes safely, and corresponding post-conditions that describe how the procedure mutates the heap when executed from a configuration satisfying the precondition. We can then use separation logic to compose the pre- and post-conditions for different functions, yielding the first precise shape analyses capable of running on millions of lines of code.

In this retrospective we won’t delve into the technical side of the work — for that, you can read the paper — but want to provide some information on the context that led to the paper, and our learnings since them.

TL;DR: (i) asking daring questions far beyond current capabilities is rewarding; (ii) compositionality confers benefits as regards incrementality and scale that unlocks impact; (iii) now is a great time for working in or with PL theory, with potential for outside influence as well as many unexplored problems where research is needed.

Dare to Dream (Go Huge or Go Home)

Before POPL’09 all of the authors had backgrounds in PL theory, but were gradually moving towards more practical work. This started with Smallfoot, the first verification tool based on Separation Logic, and then moved on to Space Invader, an automatic program analyzer. Eventually we got to the point where Invader achieved some of the strongest practical research results at the time: it was able to verify memory safety properties of device drivers, up to about 10k LOC, and found some bugs in a Microsoft driver that were confirmed and fixed by the Windows kernel team.

From the scientific point of view this seemed like successful work. For context, Microsoft’s SLAM tool was creating waves because of its application of verification technology to real device drivers. Invader could process similar input programs, but because SLAM assumed memory safety, Invader found bugs that SLAM had missed. The analysis technique (shape analysis) that we used to build an accurate picture of memory was considered extremely expensive, but somehow we made it to 10k LOC, where previously 1k was uncommon.

Then one day, while the four of us were sitting in a cafe in the east end of London, Cristiano exclaimed: “These results are completely useless, 10k is too small.” Cristiano and Dino were dreaming of starting a program proof company, and Cristiano saw that scaling to large codebases would unlock major potential. In his office at Imperial College, Cristiano had a single message, in large letters on his blackboard: Go Huge or Go Home! Back in the cafe, he asked: “What is the main problem blocking application to one million lines of code?” Based on his schooling in denotational semantics, Peter answered: “The need for the human to supply preconditions, we need a compositional version of our techniques that discovers preconditions.” Furthermore, based on ideas in separation logic, the analysis might be local in the state and not only code: it could aim for small specifications tracking only the resources accessed (the footprint).

The four of us then worked on the problem of finding a truly local analysis that was compositional in both code and state. After a number of false starts and blind alleys (e.g., a backwards analysis that turned out to be too slow to go huge), Hongseok and Dino made pivotal technical discoveries. Hongseok figured out how to make an intra-procedural analysis that discovers preconditions approximating the footprint, by mining information from failed proofs. Then Dino made a breakthrough, in discovering a logical concept (that we eventually called bi-abduction) to perform inter-procedural discovery of preconditions. After this, we all worked together developing these ideas into what would become the POPL’09 paper.

Compositionality and Bi-Abduction

The notion of compositionality comes originally from language semantics: a semantics is compositonal if the meaning of a compound phrase is defined in terms of the meanings of its parts and a method of combining them. Typically, denotational semantic is compositional where operational semantics is not. We can transport this notion to program analysis as follows:

An analysis is compositional if the analysis result of a compound program is defined in terms of the analysis results of its parts and a means to combine them.

Our paper emphasized three benefits of compositional analyses, compared to global ones.

1. The ability to analyze program parts, without having the context. This is an aid during program development.

2. Potential to scale. Essentially, modulo recursion, the running time of a compositional analysis can be a linear combination of the analysis times of its components, as long as the “means to combine” is not too expensive.

3. Graceful imprecision. If an analysis gets confused on one area of code, the resulting imprecision need not leak into analyses of code that follows it.

Compositonal algorithms can also have a drawback compared to their global cousins: they can lose precision. For example, if an analysis examines all the call sites of a procedure when analyzing the procedure itself, additional information can be gleaned to provide a more precise global result (perhaps a procedure is never called with a null argument, say).

It’s all well and good to wish for a compositional analysis with wonderful properties, but how can one be obtained? Our approach was based on bi-abduction, a new kind of logical inference problem.

Usually, logic works with validity or entailment statements like

which says that A implies B. Bi-abduction is an extension of this question of the form

✽ ?antiframe ✽ ?frame

Here, A and B are formulae that are given, and the problem is to discover a pair of frame and antiframe formulae that make the entailment statement valid. The ✽ in this question is the separating conjunction of separation logic. The antiframe is an abductive hypothesis: it represents missing resource which is in B but not A. In contrast, the frame refers to additional resource, in B but not A. For a specific example, if A is “pointer x is owned” ✽ “pointer y is owned” and B is “pointer y is owned” ✽ “pointer z is owned”, then the inference question is

“x is owned” ?frame

✽ “y is owned” ✽ “y is owned”

✽ ?antiframe ✽ “z is owned”

A solution in this case for antiframe is “pointer z is owned”, while a solution for discovering the frame is “pointer z is owned”.

The terminology “frame” comes from the frame problem in artificial intelligence, which is to succinctly describe what changes without having to laboriously specify what doesn’t change. The Smallfoot and Space Invader tools both inferred the frame, which helped to keep specifications small. Abductor, the tool from the POPL’09 paper, went one step further to abduce the antiframe; we made up the term “antiframe” because our inference problem shows the inverse nature of the two inferred quantities.

The way these ideas are used in Abductor is as follows. As a program analysis symbolically executes going forward over program statements, a bi-abductive theorem prover is called at each step. When there is not enough resource to prove that a step is memory safe, information is synthesized in the antiframe and fed back to the overall precondition, while the frame is fed forward with the result B of symbolic execution. So, for example, given a procedure to free all the elements of a linked list, this method infers the pre-condition that the parameter points to an acyclic singly-linked list, and it does this without help from the human. The analysis even has a remarkable begin-anywhere aspect: we can start at any procedure in a bare codebase, with no human specifications, and the analyzer will synthesize approximations of the memory the procedure expects.

Inference of the antiframe, a quantity to the left of |-, is an instance of the notion of

abductive inference. Abduction was advanced by the philosopher Charles Peirce in his work on explicating the scientific method, where he distinguished hypothesis discovery (abduction) from deduction (confirming true conclusions) and induction (generalization) inference patterns. Abduction intuitively fits the problem of precondition discovery very well: the inferred preconditions are the hypotheses. What is more, abduction often seeks the “best” or at least a “small” solution, and this meshes with the idea to attempt to infer a description of the footprint rather than a description of the global state.

Questions arise about how to demonstrate or measure whether an analysis is “good”. In the POPL’09 paper we demonstrated scale by running on large codebases, including a fragment of linux of over 3M LOC. Of course, scale is easily achievable with 0 precision, and we investigated some aspects of precision by looking at the preconditions obtained for small programs, and for one of the 10K LOC device drivers we had previously analyzed via global analysis. The analysis discovered remarkably good preconditions on many small programs, often similar to ones that humans would naturally write. For the device driver we discovered specs for 121/121 procedures, but for one of them the spec was overly constraining and ruled out numerous program paths: this was an example of precision loss compared to the global analysis.

We could not make a comparison of local and global on huge codebases (over 1M LOC), because the global shape analyses would never terminate on the large bases (or rather, they spilled RAM and we never observed them terminate). This made some people unhappy. It is important to measure precision loss on the large codebases, they said. However, this reaction got things the wrong way around. Suppose algorithm A (which incidentally does not often itself have independent justification as regards precision) never gives you results because it’s too slow, and algorithm B terminates and gives you answers. You shouldn’t then complain about algorithm B because it doesn’t match A. Rather, you should ask why algorithm A should exist in the first place, because it has much less potential to help people.

Quite apart from debates about measures of analysis quality (more on this below), a more important activity was taking place in parallel: demos!

Seeing is Believing

Cristiano had the slogan “seeing is believing”. We could cut through many questions by demonstrations. Dino cooked up a demo where, when giving a talk, we would run Abductor on a largish codebase. Often we would choose OpenSSL, over 200K LOC, and the whole analysis would take over 20min but be finished before the end of the talk. Then at the end of the talk we would show the results, where Abductor had found a number of potential memory safety errors. We would fix a memory leak, and run the tool again to confirm the fix. The second run used the begin-anywhere aspect discussed above, along with cached summaries of surrounding procedures, and would take a couple seconds instead of 20min (See here for a similar example with Infer). The advantage over global was so stark that no debate was necessary: begin-anywhere gives an incremental method, with evident potential to be used while programs were being developed. Even a linear-time, global analysis would (almost always) be too slow to keep up with frequent, small code changes.

This demo caused great excitement, more than the paper or the talks. People could see proof happening, live, and re-proofs tracking code changes; they could see potential.

Impact: Monoidics, Infer and Facebook

The startup proof company Monoidics developed Infer, an industrial version of the algorithms in the POPL’09 paper. Monoidics used the interactive demos not to illustrate ideas in scientific talks for our fellow researchers, but to help communicate with potential investors, customers, and buyers. Eventually, Monoidics was acquired by Facebook, and Cristiano, Dino and Peter joined Facebook in 2013 as part of the deal.

What we encountered upon joining went far beyond the scale of anything we had imagined. Facebook had 10s of millions of lines of internal code, being updated thousands of times per week. Infer needed to run on these thousands of updates. To get an idea of the magnitude of the challenge: just a few years before we had celebrated reaching 1k and then 10K LOC; now we were faced with needing thousands of runs on 10s MLOC codebases in a short period of time.

Fortunately, the principle of compositionality is very powerful. When a programmer submits a code modification (a “diff”), an instance of Infer is spun up in Facebook’s data centers. Using begin-anywhere, Infer would start from the code change instead of from the “main” program. Using abduction, a description of just enough state to cover that component was inferred. Infer would call an abductive theorem prover at least once, and sometimes several times, for each of the modified lines in each of the thousands of code updates per week that it ran on. We weren’t aware of many previous deployments of formal reasoning operating in such a high momentum situation: large (MLOCs) codebases, rapidly changing (1000s changes/week). And this all was 2014; now Infer runs on thousands of diffs, daily. Even though we never tested it to anything approaching this stress level in our academic work, the basic techniques from the POPL paper held up well when confronted with this scale.

The demo mentality helped Infer to succeed when we took it inside Facebook. Rather than show people a table of (say) runtimes and (putative) false positive rates, we would run the tool and show them specific reports. The engineers would then respond telling us what they did and didn’t like, and we would enter into an iterative process of improving the tool to give them what would help them; like a picture being worth a thousand words, a live, incremental demo can be more powerful than the rows in an inert table. If Infer didn’t run as fast as it did, we wouldn’t have been able to even enter these conversations. Of course, while demos are cool it’s still very important to measure as well as to show, and there we ended up using a concept — fix rate, the proportion of reports fixed by engineers — that is easier to measure than the precision or false positive rate. Fix rate has an advantage over raw analyzer precision in that it speaks to the “does it help people?” aspect of analysis quality; it’s based on how people react to analyzer reports. At the time of POPL’09 we did not yet appreciate this point – indeed, we didn’t have the tool users who could teach us about it – and the proxies of scale + speed + precision were still useful, even if they did not directly address the ultimate question of whether the analysis might help programmers.

Another key learning was that the value of analysis results decays over time to deliver. In one example, a fast, diff-time deployment saw a 70% fix rate where the same analysis run globally, overnight, saw a 0% fix rate. Analyzing code modifications enables timely feedback that fits with developers’ workflows. These observations led us to make further analyses, not always based on bi-abduction, but retaining the begin-anywhere capability. As of 2019, ten years after the POPL paper, over 100K bugs reported by Infer had been fixed in FB’s codebases before code reached production. Infer was open-sourced in 2015, and it is used by other companies such as Amazon and Spotify. You can read more in a recent CACM paper on Scaling Static Analyses at Facebook, or download Infer for yourself by visiting fbinfer.com.

PL theory in the modern world

An important lesson for newcomers to the subject is the value in being outward looking. It’s easy to get wrapped up in research, and to avoid reading other areas or talking to people outside your area, but though you do have to focus there is great value in breadth. In our case, we were working on a program analysis problem, but instead of studying the problem in a standard way — say by changing abstract domains to balance speed and precision — we brought ideas from language semantics (compositionality), from AI (abduction and the frame problem), as well as a desire to work at industrial scale, to bear on what we ended up doing. Importantly, this influenced the questions we were asking, not only the solutions we found. If we had stuck to the narrow perspective of a specific program analysis problem, especially the way of evaluating and demonstrating the work — e.g., only speed + precision, rather than killer demos — it’s doubtful that we would have obtained the results or the impact that we eventually did.

Being outward looking also makes research more fun, as well as being of more potential value to the world. The four of us are all working on new things presently. Examples: Cristiano’s work on local state in UI programming with ReasonReact, Dino’s work on temporal reasoning at scale (paper forthcoming), Peter’s Incorrectness Logic, a new foundation for bug catching , and Hongseok’s work on verification for machine learning. It’s a great time for PL research to impact the wider world, be it by interaction with industry or with other research disciplines: there’s lots of unexplored territory, new results to be obtained, and new scientific questions to be generated.

Bios

Cristiano Calcagno has held a faculty position at Imperial college, led the technical development of Infer at Monoidics and Facebook. He is currently a Software Engineer at Facebook Switzerland.

Dino Distefano received his PhD from University of Twente. He was CEO of Monoidics and he is currently Engineering Manager at Facebook and Professor of Software Verification at Queen Mary University of London.

Peter O’Hearn is a Research Scientist at Facebook and a Professor of Computer Science at University College London. He received his PhD from Queen’s University, Canada, and has previously held academic and industrial positions at Syracuse University, Queen Mary University of London, and Monoidics.

Hongseok Yang received his PhD degree in 2001 from the University of Illinois at Urbana-Champaign, USA. He worked as a faculty first at Queen Mary, University of London, UK (2006-2011) and then at the University of Oxford, UK (2011-2017), before he took his current professorship at KAIST, Korea in 2017.

Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.