Dropbox went down Friday night after some normally routine upgrades went awry. While the company restored most functionality within three hours, problems for some users persisted until Sunday.

The outage was followed by spurious claims from hacking groups that they had successfully infiltrated Dropbox. There was no evidence to support such claims, and Dropbox quickly explained on Friday that the outage was due to an internal problem. Dropbox head of infrastructure Akhil Gupta then followed up last night with more details on what caused the downtime:

We use thousands of databases to run Dropbox. Each database has one master and two slave machines for redundancy. In addition, we perform full and incremental data backups and store them in a separate environment. On Friday at 5:30 p.m. PT, we had a planned maintenance scheduled to upgrade the OS on some of our machines. During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS. A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted which resulted in the site going down.

User files were never at risk during the outage, the company said. The databases in question are used to provide services like photo album sharing, camera uploads, and API features.

Dropbox restored most service Friday night by performing recovery from backups. Problems persisted, though. On Sunday at 1:59pm PT, the company reported that "[a]bout 5% of our users are still experiencing problems syncing from the desktop client, and about 20% of users are having issues accessing Dropbox through our mobile apps." By 4:40pm PT on Sunday, "core service" was fully restored although the company was still "working through a few last issues with the Dropbox photos tab."

Dropbox says it learned a few things from the outage that might prevent future ones or speed up recovery time. The company has "added an additional layer of checks that require machines to locally verify their state before executing incoming commands. This enables machines that self-identify as running critical processes to refuse potentially destructive operations."

The company said it also developed a tool to speed up recovery from large MySQL backups because "the large size of some of our databases slowed recovery."

"We plan to open source this tool so others can benefit from what we’ve learned," Dropbox said.