As much as we love to build and ship new features, the performance and reliability of our platform is always our top priority. We take downtime seriously. We know that if Figma is unavailable, your work is disrupted, and we wanted to detail the service incidents that occurred on Tuesday, January 21, 2020 and Wednesday, January 22, 2020.

As reported on our status page and Twitter, a number of our users experienced intermittent issues or were unable to load Figma between the hours of 6 and 7 AM PST on January 21st, and at multiple points during the day on January 22nd.

These incidents do not live up to your expectations or our standards, and we apologize for the disruption they caused. In this post, we’ll dive into our analysis of the issue, share technical details that illustrate why we’re confident in our diagnosis, and discuss next steps we’re taking to prevent future incidents.

January 21, 2020

[6:11 AM PST] Our engineering team receives automated alerts about elevated error rates.

[6:54 AM PST] We discover a long-running, expensive query that has been causing elevated database CPU. After cancelling this query, performance returns to normal.

January 22, 2020

[9:45 AM PST] Database metrics show an increase in write IOPS and lock contention. However, in contrast to the previous day, database CPU is lower than normal.

[10:00 AM PST] Our engineering team receives an automated alert and begins to investigate.

[11:42 AM PST] Database load causes enough API requests to queue up that the service becomes unavailable for some users. To mitigate the database load, we cancel several inessential queries and increase allocated IOPS. Performance temporarily stabilizes.

[2:00 PM PST] Database performance deteriorates and the site again becomes unavailable.

[3:00 PM PST] We restart the database to apply settings to temporarily disable a background process which we believe was contributing to the write load. Performance stabilizes.

[7:00 PM PST] We take the service down for emergency maintenance and upgrade the database to a version we had already tested in our staging environment.

[8:15 PM PST] We complete the upgrade and bring the site back online. All site and database metrics return to normal.

Tuesday’s incident was primarily caused by a long-running, expensive query. To prevent this issue from recurring, we will be improving our monitoring of expensive queries and setting tighter bounds on allowed running time.

The root cause of Wednesday’s incident was that a routine change in database statistics caused PostgreSQL 9 to mis-plan the execution of one of our queries, causing expensive table scans and writes to temporary buffers. This was exacerbated by concurrent aggressive autovacuuming operations, which were happening because of a vacuuming backlog from the long-running query that had been terminated on Tuesday.

The upgrade to PostgreSQL 11 mitigated both of these causes: improvements to the query planner eliminated the possibility of the bad query plan, and the performance characteristics of autovacuuming have been significantly improved between versions 9 and 11.

For a detailed technical explanation of our investigation, conclusions, and next steps, read on…

In the early stages of the incident, our leading theory was that the performance problems were primarily due to automatic vacuuming. Vacuuming is a routine process that helps maintain database health. Normally it runs automatically and with minimal impact on other database tasks, but in this case, the long-running query that had been cancelled on Tuesday produced a backlog of data to be vacuumed. This crossed the threshold for a more aggressive form of automatic vacuuming intended to prevent transaction ID wraparound. This autovacuum mode has a greater impact on locking and writes, especially in the version of PostgreSQL that we were running.

During the incident, we canceled autovacuum statements running on our largest tables and saw database metrics improve temporarily. However, the stability was short-lived. Eventually autovacuuming would resume, as would write IOPS and lock contention. Fully preventing aggressive autovacuuming required changing the autovacuum_freeze_max_age database parameter and rebooting, which we eventually did. (This reboot was responsible for a portion of the downtime between 2 PM-3 PM.) While this brought relative stability to the site, write IOPS, write latency, and lock contention were still higher than normal.

The next mitigation we discussed was upgrading the version of PostgreSQL we were using. We had two reasons to consider an upgrade:

PostgreSQL 9.6 and later include an important optimization related to autovacuuming.

With PostgreSQL 10 and later, Amazon RDS (the managed PostgreSQL service we use) offers more sophisticated performance analysis tools that will help us diagnose issues such as this.

We had previously upgraded our staging database to PostgreSQL 11, tested it for several months without issues, and written a detailed plan for performing the upgrade in production. Therefore we felt confident about performing the upgrade even under less than ideal conditions. Given the risk of ongoing instability, we determined that it was best to do the upgrade in the evening rather than waiting until the weekend. Our preparation paid off: the upgrade was successful, and with site stability returned to normal, we could focus fully on diagnosing the root causes.

Because we continued to see performance issues even after disabling autovacuums, we concluded that while they may have been a contributing factor, they were not the only culprit. We turned our attention to a second factor: an unexpectedly expensive query. During the incident, we had observed that one particular statement with complex subqueries showed up repeatedly as being involved in lock contention. We restored a database snapshot from immediately prior to the database upgrade and examined the query plan for the problematic statement. Several facts about the plan stood out:

The plan was much more expensive than expected. In particular, it estimated that it would return over 20 million rows, when in fact the actual result was only 3 rows. This likely meant that either the query planner was operating on bad statistics about the contents of the tables involved in the query, or that we had encountered a bug or deficiency in the query planner.

The expensive plan performed full table scans where we expected it to use an index, and wrote a large amount of data to temporary buffers, consistent with high write IOPS and temp bytes metrics we observed during the incident.

On the upgraded database, the plan for the same query predicted row counts accurately, used the expected indexes, and did not write to temporary buffers. This gave us some confidence that the stability we saw post-upgrade was not just a fluke and upgrading had been the right call.