Last week, I had the pleasure of organizing the POPLmark 15 Year Retrospective Panel, which looked back on the POPLmark Challenge 15 years later. The POPLmark Challenge was a problem set and benchmark suite that rallied researchers to understand and push the limits of the technologies that made it possible to write mechanized (machine-checked) proofs about programming languages. The hope was to make it feasible for the programming languages (PL) community to shelve tedious and error-prone handwritten proofs about programming languages in favor of those mechanized proofs.

I was not involved in the original challenge, but I grew interested in it while writing a survey paper about mechanized proofs. The challenge had garnered hundreds of citations—so clearly it was influential—but after talking to those involved, I still couldn’t answer what I’d thought was a simple question:

Was the POPLmark Challenge successful?

Today I’d say yes: The POPLmark Challenge helped stir lasting excitement about mechanized proofs within the PL community. Beyond that, it intentionally brought the PL community together with those developing technologies for writing mechanized proofs.

Despite the advances and successes, there is a lot more to do. We still don’t really understand the limits of these technologies, and we still have much to learn about how to use them best. And we have focused somewhat excessively on certain challenges and technologies, sometimes to the exclusion of others.

In this post, I reflect on the state of affairs of mechanized proof in PL, organized around the topics and discussions that arose in the panel.

Looking Back to Look Forward

To benefit from the POPLmark challenge, we need to take a step back and look at exactly what happened. Then we need to take what we learn from that and apply it to address the problems that matter today. The purpose of the panel was to amplify this process.

The panel featured a mix of POPLmark authors and participants, as well as experts in the problems in mechanized proofs about programming languages that resonate today. About 90 people showed up to watch and ask questions. Check it out here:

Moving forward, here is just a sample of what we can learn (with links to the relevant spots in the panel video).

POPLmark sparked excitement about mechanized proofs and helped bring two communities together.

In his opening talk, Benjamin Pierce sketched out the world in 2004: Two islands, one with PL and systems people, and one with formal verification people. There were many people standing on each island, but only a few people swimming between them:

This came as a surprise to many of us junior researchers in the audience, since we had only ever known a world with boats and bridges between those islands:

The numbers we have for POPL tell some of this story: The proportion of papers accepted to POPL that we found with partial or complete mechanizations was 16 out of 66 (24%) in 2018, 19 out of 76 (25%) in 2019, and 17 / 68 (25%) in 2020. These proportions are up a bit from 7 out of 36 (19%) in 2009, and 10 out of 51 (20%) in 2014. Conferences for mechanized proofs like CPP (colocated with POPL) have also grown, and it’s not unusual to see mechanized proofs in systems and security conferences, too. Almost everyone who attended the panel had used a proof assistant (a tool for writing and checking mechanized proofs) at some point.

Sparking excitement about mechanized proofs and bringing these two communities closer were some of the stated goals of POPLmark, and they have largely come to fruition. Of course, it’s hard to know how much of this was thanks to POPLmark: some of those people already swimming between the two communities undoubtedly helped! For example, well before POPLmark in 1987, Bob Harper, Furio Honsell, and Gordon Plotkin introduced LF, a framework for defining logical systems. LF played a central role in the history of mechanized proofs for PL. The LF implementation Twelf by Frank Pfenning and Carsten Schürmann made it possible to express the kinds of PL proofs that were common at the time. Using Twelf, Michael Ashley-Rollman, Karl Crary, and Bob Harper submitted the first solution to the POPLmark challenge. Karl Crary and Bob Harper later used Twelf to write a full mechanized proof of the type safety of the language Standard ML.

Beyond LF, large proof engineering efforts like panelist Xavier Leroy’s work on CompCert helped convince the community that mechanized proofs at scale were possible. Many more of these large proof developments followed, including panelist Scott Owens’ and others’ work on CakeML.

But anecdotes from the panelists do suggest that POPLmark played a role. For example, even though CompCert was underway before POPLmark, Xavier Leroy found POPLmark inspiring. Benjamin Pierce noted that the interest around and results of POPLmark were central motivations for writing Software Foundations, his widely used series of books on programming languages with mechanized proofs in Coq.

It’s not unusual for benchmarks to have this effect. So, if there are other communities we’d like to bring together (mechanized proofs and testing, perhaps), or technologies we’d like to see more excitement about, a challenge and benchmark suite may be a good way to accomplish this. Important things to consider are choosing a problem of the appropriate size and difficulty, and carefully considering the timing of the challenge and benchmark suite.

The excitement POPLmark spurred was perhaps too narrow.

POPLmark deliberately carved out a challenge that was timely and small enough not to deter participation. The PL community in turn responded with 15 years of work that was disproportionately focused on one important challenge of mechanized proofs about programming languages, perhaps to the exclusion of other important challenges. For example, even though POPLmark called for the community to connect their proofs to language implementations, there was little emphasis on this from most of the community. The community also mostly flocked to a particular tool for writing these proofs: Coq. Perhaps this came at the expense of other excellent tools like Isabelle, HOL, and Twelf.

When we design benchmark suites and challenges, we should consider and try to prevent overly narrow focus (of course, this may come at the cost of choosing a problem of the appropriate size). For example, we should make sure that there is adequate representation of different technologies among those who design and evaluate benchmark suites and challenges about those technologies. And as researchers and as educators, we should broaden our focus. Focusing more on connecting our proofs to real languages could mean not just more confidence in our languages, but also more confidence in our papers, and better communication of our ideas. Broadening our dialogue to capture different tools could mean more productive use of the diverse tools that are available to us that have continued to evolve independently of POPLmark.

It’s time to address the problems that matter today.

Connecting our proofs to real languages is just one problem we still need to think about. When it comes to reasoning about programming languages, we could use more progress on dealing with state and concurrency. When it comes to mechanized proofs more broadly, we could always use more work on proof reuse, interoperability, and the challenges of scale.

There are many people who are currently working on these challenges. It’s fantastic that the community is focused on many of the challenges that matter today. They are especially important when it comes to writing mechanized proofs of large and complex systems. Both industrial and academic users are under time and financial constraints that make reuse and productivity particularly salient. For large and complex systems, it may make sense to have strong guarantees about only some parts of those systems, making mixed methods with good interoperability a tempting choice. We should continue down these paths.

And if we want to continue down these paths, maybe now is the right time for a benchmark suite or challenge in one of those areas—as long as we take the design considerations of scope, size, difficulty, timing, and community response that we’ve learned from POPLmark into account.

There’s a lot we still don’t know about how to evaluate success.

Of course, designing a good benchmark suite or challenge means evaluating success, and that’s difficult. The job of POPLmark was, in a sense, a bit easier than the job that’s ahead of us: it showed that mechanized proofs about programming languages were possible and gave us examples of how to use different techniques, but it did little to measure the relative success of those techniques. Measuring success is especially difficult when it comes to technologies that address the challenges of proofs about large-scale systems. It’s not clear how we can replicate the challenges of scale in a controlled and confined environment, though perhaps we can design a challenge around, for example, proofs that are robust in the face of change.

In some ways, this is symptomatic of the fact that we still have a lot to learn about evaluating programming languages and the tools and techniques around them more generally. Considering POPLmark in the broader scope of PL research, perhaps the biggest takeaway from my end is that it’s time to tackle this problem.

Bio: Talia Ringer is Ph.D. student in the Paul G. Allen School of Computer Science & Engineering at the University of Washington in the PLSE lab, specializing in proof engineering. She is an NSF GRFP fellow and a contributor to the Coq interactive theorem prover, and is active in advising, mentorship, service, and outreach.

Acknowledgments: Thanks to Tej Chajed for the numbers about 2020 POPL artifacts. Thanks to the moderator Mike Hicks and the panelists Benjamin Pierce, Peter Sewell, Xavier Leroy, Robby Findler, Brigitte Pientka, and Scott Owens for the wonderful comments throughout the panel. Thanks to the other organizers Benjamin Pierce, Peter Sewell, Dan Grossman, and Stephanie Weirich and the moderator Mike Hicks for all of the help along the way and for feedback on this post. Thanks to the audience for all of the wonderful questions. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.

Edit: Thanks to Robert Harper for pointing out that the CMU team proved the type safety, not the correctness, of Standard ML.

Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.