Following our May 10 deploy, the Archive experienced a number of issues stemming primarily from increased load during the Elasticsearch upgrade process.

As we noted in our March downtime post, the Archive hasn't been running at full strength due to this upgrade. Compounding the issue, it has taken significantly longer than planned to get the new code deployed to production, and we are now entering one of the more active times of the year. (Our daily page views for a Sunday -- our busiest day -- are over 29 million, and the normal load on our database servers is over a million queries per minute.)

You can find more details on the current state of the Archive's servers below, along with a rough timeline of the issues we experienced between Thursday, May 10, and Monday, May 14. However, the main takeaway is these issues are likely to continue until the Elasticsearch upgrade is completed and our network capacity is increased. We're very grateful for the support and patience you've shown, and we look forward to finishing our upgrades so we can provide you with a stable Archive once more.

Background: Server state

We normally have five Elasticsearch servers, but late last year we turned one of our front end machines into an Elasticsearch server, allowing us to divide these six machines into two groups: one three-machine cluster for the production site, and another for testing the upgraded code.

Having only three Elasticsearch servers meant the site experienced significant issues, so on April 11, we reprovisioned one of our old database servers, which had been producing web pages, as an Elasticsearch server in the production cluster.

In addition to the ongoing Elasticsearch upgrade, our Systems team recently completed a major overhaul intended to help with our long term stability and sustainability. Between November 2017 and March 2018, they reinstalled all the application servers, web front ends, and new Elasticsearch systems with a new version of the Debian (Stretch) operating system using FAI and Ansible. This meant rewriting the configuration from the ground up, since we had previously used FAI and CFEngine. They also upgraded various other packages during this process, and now all that's left to upgrade for the Archive are the database servers.

Timeline

May 10

16:25 UTC: We deploy the code update that will allow us to run the old and new Elasticsearch code simultaneously. (We know the new version still has a few kinks, and we expect to find more, so we're using a Redis-based system called rollout to make sure internal volunteers get the new code while everyone else gets the old version.) Because this is our first deploy since the application servers have been reinstalled, the deploy has to be done by hand.

16:56 UTC: We turn on the new Elasticsearch indexing.

21:03 UTC: We notice -- and fix -- some issues with site skins that resulted from doing a manual deploy.

May 11

05:00 UTC: We see large amounts of traffic on ao3-db06, which is both the Redis server we use for Resque and the MySQL server responsible for writes. We mistakenly believe the traffic is caused by the number of calls to rollout to check if users should see the new filters.

05:36 UTC: We increase the number of Resque workers.

10:06 UTC: The Resque queue is still high, so we increase the number of workers again.

21:00 UTC: We no longer believe the increased traffic is due to rollout, so we turn the new indexing off and schedule 45 minutes of downtime for 06:15 UTC the following morning.

May 12

06:15 UTC: In order to mitigate the extra traffic, we move Redis onto a second network interface on ao3-db01. However, routing means the replies return on the first interface, so it is still overwhelmed.

06:42 UTC: We extend the downtime by 30 minutes so we can change the new interface to a different network, but replies still return on the wrong interface.

07:26 UTC: Since we've used up our downtime window, we roll the change back.

After that, we spend large parts of the day trying to figure out what caused the increase traffic on ao3-db06. With the help of packet dumps and Redis monitoring, we learn that indexing bookmarks on new Elasticsearch is producing a large number of error messages which are stored in Redis and overwhelming the network interface.

May 13

Our coders spend most of Sunday trying to determine the cause of the Elasticsearch errors. We look at logs and try a number of solutions until we conclude that Elasticsearch doesn’t appear to support a particular code shortcut when under load, although it's not clear from the documentation why that would be.

20:45 UTC: We change the code to avoid using this shortcut and confirm that it solves the issue, but we do not resume the indexing process.

23:45 UTC: The Resque Redis instance on ao3-db06 freezes, likely due to load. As a result, some users run into errors when trying to leave comments, post works, or submit other forms.

May 14

06:30 UTC: We restart Redis, resolving the form submission errors. However, we begin to receive reports of two other issues: downloads not working and new works and bookmarks not appearing on tag pages.

16:25 UTC: To help with the download issues, we re-save our admin settings, ensuring the correct settings would be in the cache.

16:34 UTC: Now we look into why works and bookmarks aren't appearing. Investigating the state of the system, we discover a huge InnoDB history length (16 million rather than our more normal 2,000-5,000) on ao3-db06 (our write-related MySQL server). We kill old sleeping connections and the queue returns to normal. The server also returns to normal once the resultant IO has completed.

16:55 UTC: Bookmarks and works are still refusing to appear, so we clear Memcached in case caching is to blame. (It's always -- or at least frequently -- caching!)

17:32 UTC: It is not caching. We conclude Elasticsearch indexing is to blame and start reindexing bookmarks created in the last 21 hours.

17:43 UTC: New bookmarks still aren't being added to tag listings.

17:54 UTC: We notice a large number of Resque workers have died and not been restarted, indicating an issue in this area.

18:03 UTC: We apply the patch that prevents the bookmark indexing errors that previously overwhelmed ao3-db06 and then restart all the unicorns and Resque workers.

18:43 UTC: Once everything is restarted, new bookmarks and old works begin appearing on the tag pages as expected.

19:05 UTC: The site goes down. We investigate and determine the downtime is related to the number of reindexing workers we restarted. Because we believed we had hotfixed the issue with the reindexing code, we started more reindexing workers than usual to help with the indexing process. However, when we started reindexing, we went above 80% of our 1 Gbit/sec of ethernet to our two MySQL read systems (ao3-db01 and ao3-db05).

19:58 UTC: After rebalancing the traffic over the two read MySQL instances and clearing the queues on the front end, the indexers have stopped, the long queues for pages have dissipated, and the site is back.

Takeaways

We will either need multiple bonded ethernet or 10 Gbit/sec ethernet in the very near future. While we were already expecting to purchase 10 Gbit networking in September, this purchase may need to happen sooner.

Although it has not been budgeted for, we should consider moving Redis on to a separate new dedicated server.

While we are running with reduced capacity in our Elasticsearch cluster and near the capacity of our networking, the reliability of the Archive will be adversely affected.