During the 1.0.2 patch we experienced a significant amount of downtime. In total the realm was completely down for 46 minutes. While we take a lot of precautions to prevent these kind of problems as part of our build and deployment process, in this case our procedures failed. I would like to personally apologise to everyone who was affected and assure you that we will take steps to make sure that this does not happen again.



Timeline and Technical Details



At 18:33 NZT the realm was stopped to deploy patch 1.0.2. This patch contains a database migration for characters which means that during startup the characters database is fully rewritten to a new database file in the new modified format. In the past this has taken around 2-3 minutes and happens automatically as part of the deployment process.



By 18:40 NZT we were concerned that this process was taking too long. We inspected the logs and found that the database migration process had stalled after approximately 2 million characters on all of the 5 database shards.



At approximately 18:43 NZT I made the call to abort the deploy, clean and then restart the deployment process including the migration.



By approximately 18:48 NZT it was obvious that the migration had stalled again and so we began deeper troubleshooting.



At 18:55 NZT I made the call to roll back the realm to the previous version 1.0.1c. It was my opinion at the time that a fix to this problem would probably involve a lengthy investigation and I suspected that it would require creating a fix in the code and recompiling the backend and/or migration tools. This would have meant that a fix would have taken at least a few hours to apply. I therefore decided that it would be better to make the realm available during this time on the previous version.



I then began a deeper investigation and an attempt to reproduce the problem on a local backup of the production database while Thomas, our sysadmin, rolled the realm back.



By 19:00 NZT the rollback to 1.0.1c was live, but due to a configuration change to the database that had been applied during the move forward to 1.0.2 players were unable to log in. This change was found and then fixed by 19:10 NZT at which point play could continue normally on 1.0.1c.



By 19:20 NZT I discovered the source of the original migration problem by investigating on the local database backup. The cause was a configuration change to the database on November the 2nd. This is the same configuration change that resolved the server stability issues that we experienced before that time. The change applied a speed limit to disk writes during a database checkpoint to prevent starvation of disk reads while the checkpoint was running. Since that time we had not done a database migration of any significant size.



Unfortunately the speed limit was also applied to the tool that performed the database migration described earlier. When the migrated database was closed, the dirty data in the in-memory cache is flushed to disk, and the same speed limit was applying to this operation causing it to take a very long time.



After understanding what the problem was we realised that we could deploy patch 1.0.2 after all by simply disabling the database option before the migration step, then reenabling it before starting up the game servers. Doing it this way would not require us to prepare a new patch. I made the call to proceed with this plan.



The servers were stopped again at 19:22 NZT. This deploy would take longer than the usual automatic process that we use due to the fact that we had to make several manual configuration changes during the process in order to make it work.



By 19:31 NZT the realm was working and 1.0.2 was deployed.



Process Changes



One of the primary ways that we make sure that these kinds of database migrations will not have any problems is by applying them on a backup of the database that we have in our office. This realm is called "Staging" and our QA department uses it to test new patches before they are applied to the live realm. Our QA department logs in to characters after a migration to verify that all the data is intact and no mistakes were made.



In this case the automatic build infrastructure was set going overnight and so the migration process was unattended. Due to the fact that the problem simply caused the migration to take significantly longer than expected, it did complete through that night and QA reported no problems.



Nobody checked that the migration took a reasonable amount of time.



In the future we will make sure to verify that the migration proceeded as expected by inspecting the actual logs of the migration, and make sure that the time taken for this operation is reasonable. This will ensure that this kind of problem doesn't happen again.



Once again, I would like to apologise for this incident. It does not represent the quality that I would hope you would expect from our company and I, personally, let you down. I hope you enjoy patch 1.0.2.



Jonathan Rogers

Technical Director

Grinding Gear Games

Path of Exile - Lead Programmer