At 7:40pm NZT today, all Path of Exile service was disrupted. While downtime does happen from time to time, this level of disruption is not acceptable.



At 7:45pm NZT, our sysadmin Thomas was alerted by our automated monitoring tools that the realm was experiencing issues. Unfortunately he was away from a computer at the time and was unable to respond immediately. Normally, both Thomas and I are both available to respond to server incidents so that one of us will always be available. Unfortunately, I am currently in the US at the Game Developers Conference and thus did not get server notifications.



At 8:00pm NZT, we were notified by our support team that the realm was experiencing issues and that Thomas had not responded. I began the process of attempting to diagnose the issue. According to the logs all game instances on every server were crashing on startup.



At 8:11pm NZT, I made the call to attempt a realm restart to see if it would resolve the problem. This restart completed at 8:15pm NZT, but did not resolve the issue. I continued to investigate.



At 8:23pm NZT, I finally discovered that the problem was a malformed spam list update that had been pushed to production. A mistake was made in the formatting of the file and this crashed the server when the file was loaded.



By 8:27pm NZT, we had we had made the needed changes and began pushing it to production.



At 8:30pm NZT, the update had been pushed and the problem was resolved.



While this incident was caused by human error in formatting a file, it both should not have been able to occur in the first place and our response to this incident should not have taken the length of time that it did. I will outline the steps that we will take to prevent another incident like this occurring again.



The first and most obvious problem is that our response time on this incident was too slow. This is because I was away and Thomas happened to be unavailable. When I was able to respond, I was doing so on a hotel internet connection in the US tunnelling through our office in NZ and then back to the US. Not the best situation for debugging problems. Fortunately, we have already recently hired a new system administrator who will be joining us in a few weeks. This will mean that we can always have at least two server admins on call at all times, even when I happen to be travelling.



The second problem is that a file was put in to production without first being test loaded. This is a process problem that should not have occurred. Normally any file that is on production will be test loaded to verify that it doesn't cause any issues. Updating the spam list is a feature that was implemented outside our normal asset testing pipeline and so didn't receive the rigorous testing that we normally do on assets. From now on, spam list updates will first be tested before being deployed.



The third fix is for the actual crash in the loader for the spam list that caused the problem.



I am very disappointed in myself that this incident was allowed to occur. We didn't have the staff required to cover for my absence, and I would like to apologise to all our players for the inconvenience caused. This is not the level of service you should expect and the process changes we are making will prevent an incident like this reoccurring in the future. Path of Exile - Lead Programmer Last edited by Jonathan on Mar 20, 2014, 8:36:19 AM