By Ben Sedat - May 23, 2012

Update 5/24/2012: United has confirmed that they have found and fixed this specific issue.

Something we talk about a lot at Tinfoil is the existence of two mindsets when engineering software: building and breaking. Thinking about security requires a different mindset than building working software. You have to keep all of the terrible things that can happen in your head. Whether this validation is personal, peer reviewed, statically or dynamically analyzed (by someone like us), it is crucial to making sure your software performs like you think it does when fate (or malicious users) conspire against it.

In this case, it was a late night and I was trying to buy a last-minute flight, shopping around to get a reasonable price. I had several tabs open to various airlines, and was searching on and off for a few hours. I finally made a decision and decided to purchase a ticket from United Airlines.

I picked a seat and was presented with a page to enter my info for the TSA (pretty standard these days). United had recently updated their site’s interface and had a dropdown to select saved passengers. I clicked the dropdown and was surprised to see a large number of names, none of which were mine. I looked down the list, noticing patterns in people with the same last name, and realized what I was likely looking at: the passenger manifest for the flight.

Kind of scary, and nothing I had any business looking at. This was something that I ran into completely organically, no shenanigans or security testing on my part (we need approval from a site’s owner to run most security testing, and I’m not going to go out and violate wire fraud laws).

As serendipity would have it, a help widget popped up on the page, likely because I wasn’t moving forward in the purchasing process. I dutifully called United to report the problem. Emulating my account they weren’t able to reproduce the issue. I still could, and several parts of the site like the account management page were completely broken and displaying “None” for all of the values. Finally giving up and logging out, everything was back to normal. Hmmm.

So what was going on?

I don’t have direct access to United’s code, but I think that my session (likely invalid) was part of the problem, since logging out seemed to solve the problem. Sessions, especially long-lived ones, can be tricky to manage. If my session was broken, I should have been issued a new one or in the worst case (from a UX perspective) lost my progress and had to log in again. Instead, it defaulted to showing me things that didn’t belong to me. Some list was probably supposed to be filtered by the user in the session, and since my user was now unset or invalid nothing got filtered out. Just some educated guessing here but it illustrates a situation where an invalid session leads to a lot of private information getting leaked.

Session management and authentication are listed as an OWASP Top 10 issue and probably deserve their own blog post but the overarching principle that could have saved the day here is Failing Securely.

Failure scenarios and edge cases come up frequently in software, but can sometimes be left to the wayside. Fate conspires against us, power gets cut, kernels panic, null pointers abound and software needs to respond accordingly if it can. If it can’t, it needs to fail out as gracefully as possible. Constructs like database transactions can assist, but they need to be considered carefully when validating the system.

Netflix takes this to the extreme with their Chaos Monkey, which randomly kills processes. Your code could stop executing at any time, and for no good reason. Random faults like this pose a greater challenge to software builders, but accounting for it can be considered in two stages.

The first stage (the easier one) is to just fail out. The job or request fails, and hopefully something will drive it again. The trick here is to consider where and how things can fail, and how the system will respond. The analysis of what customers will see, what background jobs get stuck, or state that could be in limbo is what makes this interesting. Even if your database is rolling back a transaction, other actions like network requests can’t be rolled back. It can’t always be picked up by static or dynamic analysis, but good tests can help set expectations.

The next stage, recovering from a fault without interrupting the user, is more complex. Ideally you want to rescue the situation and keep marching forward: self-healing. Not always possible, and there’s a balance of complexity to consider but a system that can maintain itself is a powerful thing.

United has had some time to work through these issues, so we’re pretty confident this has been fixed, and we haven’t been able to reproduce it since.

Whether you go for something heavy handed like the Chaos Monkey or wait for Murphy’s Law to intervene, keep failure in mind. Thinking about failure early, as United should have, will help ensure that you won’t fail so publicly later. Tinfoil Security can also help with that, finding vulnerabilities like this before your customers do.

(Thanks to Solomon Bisker, Spike Gronim, Koyel Bhattacharyya, Ray He, Nick Semenkovich, Michael Borohovski, and Ainsley Braun for reading drafts of this post.)