From MozillaWiki

Firefox Add-On Outage Technical Report: Analysis and Recommendations

Authors: Peter Saint-Andre, Matthew A. Miller

Last updated: 2019-07-02

Introduction

On May 3, 2019, the intermediate certificate used to sign all deployed add-ons expired. Although this expiration was not unexpected (in fact, plans were in place to deploy a replacement certificate, which had already been generated), the consequences were unforeseen: almost all deployed add-ons stopped working for millions of Firefox users. This report summarizes our findings regarding the technical aspects of this incident and provides recommendations for avoiding or mitigating such incidents in the future.

Root Causes

On the face of it, the root cause might seem simple: a signing certificate expired and we didn’t renew it in time. If we just fix our monitoring tools and pay attention in the future, this won’t happen again. Right?

Unfortunately, it’s not that simple.

Yes, a signing certificate expired. But this would not necessarily cause Firefox to treat existing add-ons as invalid. Would it be good enough for the signature to be valid at the time the add-on was signed? Should we really need to re-sign all existing add-ons whenever we generate a new signing certificate (even if this were feasible), or would we use a new cert only for newly-issued add-ons? How would we handle add-ons for long-lived releases like ESR? And so on.

A combination of factors was involved here: (1) the expiration of a signing cert and (2) runtime code for checking the date of that signing cert during add-on validation. Thus the primary root cause was more about differing assumptions than a failure to monitor or update. Specifically:

The server-side teams whose systems (e.g., Autograph) generate signatures knew the certificate was expiring, but did not see a problem because they had reason to believe that signing certificate dates were not checked by the client.

The client-side teams whose code performs add-on validation knew that dates were not checked for end-entity certificates (we modified this behavior in a previous outage), but might not have realized that dates for intermediate certificates were still checked when invoking the nsIX509CertDB function in the core SSL library.

The crypto teams who maintain that SSL library simply provide an API and aren’t responsible for how client-side code uses that API.

The testing teams who ensure the quality of Firefox weren’t aware of the need for test coverage of expiring intermediate certificates.

These various teams did not cross-check their underlying assumptions and understandings, or engage in “what-if” scenario planning and future testing. Thus the broader Firefox team did not have a firm grasp on the functioning of a complex system with the potential to impact the entire Firefox user base.

Our conclusion is that this incident was not the fault of any individual or team, but was the result of having an interlocking set of complex systems that were not well understood across all the relevant teams.

Secondary Complications

The various teams mentioned above, and many others, responded quickly, professionally, and thoroughly once the scope and impact of the incident were realized. The effort involved was truly impressive.

However, several aspects of the response were hindered by secondary complications, and it is helpful to understand these in addition to the root causes.

First, we have a very large number of deployment targets, not all of which can be handled in the same way, and we had several techniques (hotfixes, dot releases, etc.) available for pushing changes out to different platforms and versions. Lack of understanding and documentation about these targets and techniques caused some delays in deployment of fixes.

Second, because the initial hotfix involved Normandy (which required users to voluntarily enable Telemetry if they had it turned off), we gathered more data than we ordinarily do, and lost legitimate data in the process purging this over-gathered data. Additionally, other Normandy studies were temporarily disrupted which could have impacted their outcomes.

Third, lack of documentation and experience with emergency processes led to some confusion and delay in responding (e.g., responsibilities are not always well understood before incidents occur and emergency contact procedures and methods were not well established).

Fourth, the lack of in-house or on-call QA resources caused delays in testing proposed fixes across various platforms because our external teams at Softvision were not immediately available through normal channels (in fact, engaging with individual Softvision team members could have introduced legal complications and the potential for data leakage).

Recommendations

Spurred by this incident, teams from many areas of Mozilla have already been thinking about opportunities for improvement in our systems, processes, documentation, and operations. At a high level, we suggest a focus on the following areas. This fuller remediation does have a deadline; the newly created intermediate certificate that resolved this incident is set to expire in 2025.

Most fundamentally, the full Firefox team does not have a common understanding of the role, function, and operation of cryptographic signatures for Firefox add-ons. For instance, although there are several good reasons for signing add-ons (monitoring add-ons not hosted on AMO, blocklisting malicious add-ons, providing cryptographic assurance by chaining add-ons to the Mozilla root), there is no shared consensus on the fundamental rationale for doing so. In addition, maintaining a full public key infrastructure (PKI) is a complex task and we do not necessarily have a firm grasp of the engineering and business tradeoffs involved. More complete documentation of the overall system (and the role of each sub-system therein) is critically important, as is communication about that architecture to all relevant teams and training of new team members. This documentation should include accurate, up-to-date information about all relevant inputs, outputs, dependencies, APIs, protocols, formats (e.g., subject and issuer identities), tooling, monitoring, operations, and responsible teams and/or individuals.

Second, if we are committed to the current PKI and add-on signing approach, we should make improvements to our certificate management processes, especially our key rollover strategies. This will involve clear procedures for handling revoked and expiring intermediate and root certificates, including the impact on existing add-ons, updated add-ons, and new add-ons for the full range of both add-ons (web extensions, themes, language packs, etc.) and current/legacy release channels. Expectations need to be set and communicated for all aspects of key rollover, including but not limited to management of the offline hardware security module (HSM), interactions with the cloud HSM, monitoring of expiration times for the full set of deployed certificates, client-side validation of the full certificate chain (end-entity, intermediate, and root), acceptable expiry ranges, and extensive testing and future-proofing (such as setting system clocks to future times on selected test rigs).

Third, we need to rationalize and improve our code delivery mechanisms. This involves several things. For instance, we need a much better and communicated “map” of the full range of our deployment targets: not just current desktop targets (Nightly, Beta, Dev Edition, Release, ESR) but also mobile and mixed reality as well as legacy versions which we judge important enough to update even if they are not officially supported. We need a clear understanding of hotfix techniques (e.g., Balrog vs. Normandy or a more special-purpose tool) and dot-release requirements across those deployment targets. We need to decouple our update mechanisms from data gathering mechanisms. And we need to set clear expectations internally, with end users, and with partners.