Hi Everyone,

We want to officially apologize for all the Server Problems today.



We understand that it is beyond frustrating to be repeatedly kicked from the servers, and not be able to get back online.



I want to give you some insight into why this happened, what happened, and what we're doing to solve this problem going forward.



THE WHY

In the broadest terms, the servers were hacked back on Feb 14. As a consequence of this, we lost all the server code that handled configuration of the servers, authentication, etc. We rewrote all of this from scratch using a much newer and better tech stack than we had before.



However, there are always bugs with something new, and sometimes it takes months before these bugs reveal themselves. These are bugs in both the code, but also in our processes. Once we diagnose these bugs in the code and our processes, we can fix them so they won't happen again.



THE WHAT

There were three problems that happened over the last two days:

1. Some servers became low on memory. This is a simple balancing problem to make sure that the load is evenly distributed across all the machines. We made adjustments on Thursday and Friday to evenly distribute the load by moving the zones around. Unfortunately, I personally made an error when doing this, which leads me to the second error:

2. There is a requirement that there is at least one zone active on each of the machines. However, when I rebalanced, I left some machines unpopulated for some regions (US1, US2, etc.). This caused the problem with some zones failing late Friday night.

3. We created a new script to automate rebooting the machines. The problem is that this script, since it was automated, ran much faster than when we did this manually. This created the situation where some machines were rebooting before others had completely and totally shut down, and before they had freed up all their resources. This led to not enough resources available, and so some machines failed to reboot properly. This is what programmers call a "Race Condition"- which means something happened earlier than it was supposed to.



THE SOLUTIONS

We fixed #1 : the zones are now well-balanced across all the machines. We fixed #2 by making sure all the machines are populated. Now that we are very aware of this requirement, we won't make this error again. Additionally, we are going to add some warnings or otherwise prevent this requirement from causing a bug in the first place. And, for #3, we are now rebooting the servers manually to avoid the race condition, and later we will rewrite the script to add a simple delay after shutting down all machines and before restarting them all.



MOVING FORWARD

It's likely that there are still some problems with the new server setup, however, we have fixed the three problems mentioned above (in addition to several we fixed in the preceding weeks). We will continue to fix and improve as we go, and although I can't promise that we will never have server problems again, I am confident that the frequency will continue to decrease, until, hopefully, it is extremely rare for problems to happen.



Again, sorry for the problems, and hope you have a nice weekend,

Damon



p.s. I didn't want to post the following analogy in the post above since it's kind of a side note, but it's still interesting. I was a pilot for several years. The FAA has an excellent program in place to increase the reliability of aviation over time. Every time some kind of problem is discovered with a specific model of an aircraft, the FAA will diagnose it, and then they will issue an "airworthiness directive". Example: some specific aircraft is discovered to have a structural weakness in the rudder, so the FAA will order all aircraft of that type to receive a reinforcing plate. The owners must then immediately take their aircraft to a mechanic to get this work done. Over time, it is actually the oldest models of aircraft that are, on average, the safest. All of their problems have been discovered and fixed. It's the newest models that are untested, and have the potential to have undiscovered problems.



There's a similar analogy with pilots. They say the safest pilot is the one who has made every mistake once.