Designing for failure

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

Nobody starts a free-software project hoping that it will fail, so it is a rare project indeed that plans for its eventual demise. But not all projects succeed, and a project that doesn't plan for failure risks is doing its users harm. Dan Callahan joined Mozilla to work on the Persona authentication project, and he was there for its recent shutdown . At the 2017 linux.conf.au, he used his keynote slot to talk about the lessons that have been learned about designing a project for failure.

Mozilla is a non-profit organization dedicated to the open Internet. It "does lots of stuff", including the Firefox browser. Firefox helps to protect the net as an open resource in a number of ways, including giving Mozilla a place at the table in settings where the design of the web is under discussion. The web, he said, is too great to leave in the hands of corporations.

Callahan joined Mozilla to work on the Persona project, which sought to simplify and decentralize the process by which people log into web sites. Using Persona, users would go to a site and enter their email address there; they would then be sent to an authentication page under the email-address domain. If they authenticated successfully, they would get a certificate attesting to their identity, which could be used to log into multiple sites. The design was meant to be fully decentralized, with no big sites, not even Persona, in charge of authentication.

Authentication matters, he said, and improving it was a worthy goal. It is worth thinking back to where we were five years ago; the news was full of web-site break-ins and loss of passwords. There was little that users could do to ensure their safety beyond following good password hygiene, and few of them do that. Securing password-based authentication is not a solvable problem.

In response to this problem, sites were replacing passwords with "social login" options whereby users would log in via another provider. This mechanism deprives users of the ability to choose their identity; it "diminished the soul of the net." Social login imposes a third party between a site and its users, and subjects those users to that party's terms of service. For example, Facebook's "real name" policy has tripped up many users. In such a world, there can be no anonymous whistleblowers, no pseudonyms. It represents the loss of a fundamental human right. We cannot, he said, build a free platform without giving people the ability to choose how they identify themselves. Persona allows users to use any identification they want, but it failed. It showed that decentralized authentication is possible, but it failed to change the web.

Callahan is a cave diver, meaning he finds underwater holes and swims as far into them as he can. It is a dangerous endeavor, requiring a lot of equipment and training. Cave divers have developed a number of techniques for dealing with failures, and every dive explicitly tests failure recovery in some way. Years ago, Sheck Exley looked at all known deaths from cave diving in an effort to find the general causes of failure; he came out with five rules:

Do not exceed your training. Maintain a guide line to open water at all times. Reserve 2/3 of your gas for the exit. Do not go beyond the maximum depth of your gas mixture. Carry three lights.

At the time this was written, following those rules would have prevented all known cave-diving deaths. The free-software community, he said, can learn from what cave divers do, and should come up with its own rules.

Three weeks ago, the Persona servers went read-only, with no further changes allowed; eventually they will fall off the net entirely. We need ways to examine failures like this. If you have a failing project, he said, you should share what is going on so that the community can avoid repeating mistakes.

Lessons learned

The first lesson to be learned in this case is that a free license is not enough to ensure a project's success.

There was a design failure in that the protocol still had a point of centralization. The email provider site doing authentication could not talk directly with the web site; instead it had to go through a relay. The goal was to eventually build the relay into the browser itself, but the Persona project did not plan for a loss of development resources before native browser support was implemented. That meant that anybody wanting to fork the project would have to fork the relay as well — a relay whose location was wired into the sites using Persona. This is a problem that could have been solved, but they were blinded by the context in which they were working and didn't see it.

Bits rot more quickly online, he said. If the LibreOffice project were to go away, we would still have working applications on our systems and could still access our documents. But what happens if WordPress suffers some unfortunate fate? All of those WordPress-based sites would not last long. We need to do better at writing software that can run in a stable mode without requiring people with high skills. He doesn't know how to do that; that's why he was giving a keynote, he said: he gets to present problems for others to solve.

"Complexity limits agency" was another one of the lessons. A project with a lot of moving parts requires a lot of skills just to set it up. People with such skills tend to be in high demand and not generally available; that is not a situation that empowers people. A free license, he said, does not further freedom for people who cannot run the software.

There were a number of little mistakes. The Persona user interface would put up a popup window for the authentication, with the idea that the context of the underlying page would be preserved. But a lot of users reflexively close popups without even looking at them; then they wonder why Persona isn't working. The project built a system that didn't mesh with user heuristics.

Mistakes in the API design led to lots of bugs; that didn't help either.

The project was not measuring the right things, he said; "we did not know who we really were". Was Persona a development project, or was it network infrastructure? It was staffed and developed like a project, and measured its success by the number of users it had. If, instead, Persona had seen itself as infrastructure its developers would have asked different questions: was that infrastructure solving a real problem? This disconnect led to the wrong design decisions and a certain amount of "we will solve the web" hubris.

A project should explicitly define and communicate its scope, drawing clear boundaries between what the project is and what it is not. Did Persona verify email addresses, or did it solve the identification problem? The way the project's scope was defined, web sites almost had to be subservient to Persona.

While the Persona project was going on, Mozilla was also trying to start a new mobile phone. Phones need authentication too, so it was deemed that Persona could fit into that role. It is true that it could fit, he said, if one applied a great deal of force. But, in truth, it was the wrong tool for the job and did not fit well.

Projects should ruthlessly oppose complexity. Persona suffered from an explosion of options and dependencies, resulting in complex code that made everything harder. Among other things, that makes it harder for new contributors to join the project. In this case, there were only one or two outside contributors who did any significant work; when Mozilla stepped away from the project, nobody else was there to pick it up. Developers on a project should be able to say immediately if their system behaves as they think it should. "Focus and simplify."

Planning for failure

Persona made its share of mistakes. But, even when everything has been done right, projects can fail for any of a number of reasons. Thus, developers should be planning for failure from the beginning.

If you know your project is dead, Callahan said, you should say so. Persona took three years to go from the removal of staff to the unplugging of the servers. Mozilla tried to maintain the system without development resources, taking some 20 months to say that things were not working. There is a natural fear that admitting death is a self-fulfilling action; one always hopes that the project will come back to life. But admitting that this will not happen lets people prepare a replacement.

A project should ensure that users can recover without its involvement. Failure quickly leads to a demoralized, burned-out state; it is really hard for developers to do recovery work at that point. One of the things Persona did right was to use email addresses for identification; that allowed sites using Persona to send a password-reset email to affected users. The data needed to recover from Persona's failure was available outside of Persona itself. In general, projects should use standard data formats, and have users store their own data.

To conclude, projects should seek to minimize the harm that results if and when they go away. Like a diver with three flashlights, a user of a failing but well-planned project can switch to another. We have to talk about our failures, he said, because the alternative is to continue repeating the same mistakes.

[Your editor would like to thank linux.conf.au and the Linux Foundation for assisting with his travel to the event.]

