There was a power outage affecting downtown San Francisco today, which also caused an outage at SomaFM's primary datacenter, 365 Main . Note that we've been there for about 2 years now, and this is the first power outage that's affected us. They had another outage right before we moved in, due to a faulty fire alarm which cut power to most of the building.

Now, a "world class datacenter" is supposed to have all sorts of redundant systems in place. And they did. But a slightly unusual series of events proved that even with all that redundancy, things can go very wrong. Here's what really went down at 365main as far as I can tell:

365 Main, like most facilities built by Above.net back in the day, doesn't have a battery backup UPS. Instead, they have a "CPS", or continuious power system. What they are is very very large flywheels that sit between electric motors and generators. So the power from PG&E never directly touches 365main. PGE power drives the motors which turn the flywheels which then turn the generators (or alternators, I don't remember the exact details) which in turn power the facility. There are 10 of these on their roof (or as they call it, the mezzanine; it's basically a covered roof). These CPS units isolate the facility from power surges, brownouts and blackouts.

The flywheels (the CPS system) can run the generator at full load for up to 60 seconds according to the specs.

There are also 10 large diesel engines up on the roof as well, connected to these CPS units. If the power is out for more than 15 seconds (as I recall, I could be wrong on the exact time), the generators start up, and clutch in and drive the flywheels.

There is a large fuel storage tank in the basement, and the fuel is pumped up to the roof. There are smaller fuel tanks on the roof as well, with enough capacity to run all the generators until the fuel starts getting pumped up to the roof.

Here's what I suspect happened:

It was reported there were several brief outages in a row before the power went out for good, so I bet the CPS (flywheel) systems weren't fully back up to speed when the next sequential outage occurred. Since several of these grid power interruption happened in a row, and were shorter than the time required to trigger generator startup, the generators were not automatically started, BUT the CPS didn't have time to get back up to full capacity. By the 6th power glitch, there wasn't enough energy stored in the flywheels to keep the system going long enough for the diesel generators to start up and come to speed before switching over.

Why they just didn't manually switch on the generators at that point is beyond me. (I bet they will next time!)

So they had a brief power outage. By our logs, it looks like it was at the most 2 minutes, but probably closer to 20 seconds or so.

So it looks like the diesels did cut over, but not before the CPS was exhausted in some cases. The whole facility did not lose power I'm told, just most of it.

Here's the letter their noc sent to customers about this:

This afternoon a power outage in San Francisco affected the 365 Main St. data center. In the process of 6 cascading outages, one of the outages was not protected and reset systems in many of the colo facilities of that building. This resulted in the following: - Some of our routers were momentarily down, causing network issues. These were resolved within minutes. Network issues would have been noticed in our San Francisco, San Jose, and Oakland facilities. - DNS servers lost power and did not properly come back up. This has been resolved after about an hour of downtime and may have caused issues for many GNi customers that would appear as network issues - Blades in the BC environment were reset as a result of the power loss. While all boxes seem to be back up we are investigating issues as they come in - One of our SAN systems may have been affected. This is being checked on right now If you have been experiencing network or DNS issues, please test your connections again. Note that blades in the DVB environment were not affected. We apologize for this inconvenience. Once the current issues at hand are resolved, we will be investigating why the redundancy in our colocation power did not work as it should have, and we will be producing a postmortem report.

Lots of companies were affected. There was a huge line to get into the data center. It was definitely the most people I've ever seen there!

Labels: 365main, infrastructure, streaming