September 17, 2010

After posting my speculation about the JPMorgan Chase database outage, I was contacted by – well, by somebody who wants to be referred to as “a credible source close to the situation.” We chatted for a long time; I think it is very likely that this person is indeed what s/he claims to be; and I am honoring his/her requests to obfuscate many identifying details. However, I need a shorter phrase than “a credible source close to the situation,” so I’ll refer to him/her as “Deep Packet.”

According to Deep Packet,

The JPMorgan Chase database outage was caused by corruption in an Oracle database.

This Oracle database stored user profiles, which are more than just authentication data.

Applications that went down include but may not be limited to: The main JPMorgan Chase portal. JPMorgan Chase’s ability to use the ACH (Automated Clearing House). Loan applications. Private client trading portfolio access.

The Oracle database was back up by 1:12 Wednesday morning. But on Wednesday a second problem occurred, namely an overwhelming number of web requests. This turned out to be a cascade of retries in the face of – and of course exacerbating – poor response time. While there was no direct connection to the database outage, Deep Packet is sympathetic to my suggestions that: Network/app server traffic was bound to be particularly high as people tried to get caught up after the Tuesday outage, or just see what was going on in their accounts. Given that Deep Packet said there was a definite operator-error contributing cause, perhaps the error would not have happened if people weren’t so exhausted from dealing with the database outage.



Deep Packet stressed the opinion that the Oracle outage was not the fault of JPMorgan Chase (the Wednesday slowdown is a different matter), and rather can be blamed on an Oracle bug. However, Deep Packet was not able to immediately give me details as to root cause, or for that matter which version of Oracle JPMorgan Chase was using. Sources for that or other specific information would be much appreciated, as would general confirmation/disconfirmation of anything in this post.

Metrics and other details supplied by Deep Packet include:

The Oracle database was restored from a Saturday night backup. 874K transactions were reapplied, starting early Tuesday morning and ending late Tuesday night.

$132 million in ACH transfers were held up by the JPMorgan Chase database outage.

Somewhere around 1000 each auto and student loan applications were lost due to the outage.

The Oracle cluster has 8 biggish Solaris boxes (T5420 with 64 GB of RAM).

EMC is the storage provider. In early trouble shooting, EMC hardware was suspected of causing the problem – specifically in a SAN controller — but that was ruled out at some point Monday night.

JPMorgan Chase’s whole fire drill started at 7:38 Monday night, when the slowdown was noticed. Recognition that the problem was database related was very quick (before 8 pm).

Before long, JPMorgan Chase DBAs realized that the Oracle database was corrupted in about 4 files, and the corruption was mirrored on the hot backup. Hence the manual database restore starting early Tuesday morning.

And by the way, even before all this started JPMorgan Chase had an open project to look into replacing Oracle, perhaps with DB2.

One point that jumps out at me is this – not everything in that user profile database needed to be added via ACID transactions. The vast majority of updates are surely web-usage-log kinds of things that could be lost without impinging the integrity of JPMorgan Chase’s financial dealings, not too different from what big web companies use NoSQL (or sharded MySQL) systems for. Yes, some of it is orders for the scheduling of payments and so on – but on the whole, the database was probably over-engineered, introducing unnecessary brittleness to the overall system.

Related link

Comments