BART foul-up traced to botched computer upgrade

Traffic back up on Friday morning southbound Pleasant Hill Road in Lafayette. Disruptions with BART service led to long delays on Bay Area roads. Traffic back up on Friday morning southbound Pleasant Hill Road in Lafayette. Disruptions with BART service led to long delays on Bay Area roads. Photo: Courtesy @rosie1208 / Instagram Photo: Courtesy @rosie1208 / Instagram Image 1 of / 8 Caption Close BART foul-up traced to botched computer upgrade 1 / 8 Back to Gallery

BART engineers are trying to determine why a regular, nothing-special system update managed to crash the transit district's computers on Friday, slowing late-night train traffic to a crawl and shutting down morning commuter service for hours.

"This was a normal update to the system that we do all the time," said BART spokeswoman Alicia Trost. "We didn't even realize there was a problem until around midnight, when the central computer shut down."

The loss of computer systems forced the trains to run largely under manual control, causing delays of more than 90 minutes for late-night riders. BART stations didn't reopen until after 7 a.m. for the morning commute, forcing riders into cars and buses and jamming Bay Area roadways.

It was the unusual circumstances behind the problem that resulted in what transit district officials are calling the longest computer failure in memory.

BART has had such computer outages before, said Paul Oversier, the transit system's assistant general manager for operations, but they're typically brief.

"Usually, it's minutes, not hours," he said. "One this bad, I can't remember."

There was nothing brief about Friday's breakdown, which started hours before the troubles became evident.

Engineers installed the configuration update to a computer network server on Thursday morning with no apparent hiccups. But a glitch in the update began quietly spreading through the system, resulting in a cascade of problems.

"It didn't happen all at once," Trost said. "There was a slow degradation of the system."

Information exchange

The problem eventually began to affect the exchange of information between BART's many computer servers, including the part of the system the Operational Control Center relies on to monitor train service.

Without computers controlling the system's 400 track switches on the main line, operators had to get out and physically change those switches to ensure that their trains remained on the proper tracks, with each change taking from five to 10 minutes, Trost said. At some of the more complicated switch points, special crews were sent out to work them.

"It was all hands on deck," she said. "If you were trained and authorized to crank a switch, you were out there."

It was 3 a.m. before the final train finished the stop-and-go process and clanked its way to the end of the line, more than an hour-and-a-half later than scheduled. Dozens of engineers and technicians worked in the computer control center in downtown Oakland to get the system up and running, but that didn't happen in time for the regular station openings at 4 a.m. After setting and missing a 5 a.m. goal to restore service, BART officials, in effect, surrendered.

In e-mail and Twitter updates sent out shortly before 7 a.m., transit district officials warned, "There is no BART service this morning, and the Bay Area is urged to seek alternative means of transport during the morning commute."

Not surprisingly, there were plenty of commuters who didn't get the word.

Cascading problems

Denise Flagg rolled up to the Orinda BART station at 4:29 a.m., as she does each weekday en route to her finance job in San Francisco, and was stunned to find it closed. But she was also hungry. So it was an opportunity.

"They told me the trains would be back up at 5 a.m., so I turned around, went home and had breakfast," the 45-year-old Lafayette resident said. She quickly ate and made it to the Lafayette station by 4:59 a.m. - but again, nothing was rolling.

This time all she could do was wait in the dark and the cold.

"When it's working, BART is good," Flagg said. "But when something like this happens, or the strike - torture. Real torture. Happy Friday, eh?"

It was the slow-motion aspect of the computer problems that made it especially tough to run down the cause, Trost said. Because the server upgrade had been installed 12 hours before the trouble was detected, it wasn't the prime suspect in BART's desperate search for a culprit.

But once the trouble was tracked back to the upgrade, "it was a very quick fix," she said. "The server was returned to its original configuration and the problem was fixed."

Limited train service resumed at 7:18 a.m., with full service restored within 90 minutes.

Despite the ongoing contract dispute between BART and its employee unions, there's no indication Friday's problems were anything but a technical malfunction, Trost said. While riders were inconvenienced by the early morning delays and the late opening of the train system, "at no time was train safety compromised."