A firehose of real-time FAA radar data might not sound like the best way to have divined John McCain's vice presidential pick during last year's US presidential campaign; better to just cast lots, read the tea leaves, and make a wildly overconfident guess like the rest of the DC punditocracy. But thanks to an innovative website called FlightAware and a bit of digital sleuthing, plane data did in fact identify the VP pick as Alaska Governor Sarah Palin.

When McCain went to Ohio to announce his choice of a running mate, Marc Ambinder of The Atlantic noted the "journey of a sturdy Gulfstream 5 with the tail number of N22GY. Anchorage, Alaska to Hook Municipal Field in Ohio. Nearby Dayton. 30 miles away. And the plane was in Flagstaff, AZ recently." The plane was also registered to a big Republican donor.

Ambinder used the popular flight-tracking website FlightAware to piece together this information and to display the plane's flight path on a map of the country. He's not the only one using FlightAware for reasons that have nothing to do with aviation, either. When the Alabama Crimson Tide were looking for a new football coach back in 2007, FlightAware was the tool of choice for tracking the school's private plane and the private planes of other schools around the country.

The New York Times took a look at the rabid fan speculation fueled by such flight scrutiny. "Was South Carolina Coach Steve Spurrier flying into Tuscaloosa Regional Airport?" fans wondered. "Was a plane owned by the University of Alabama departing for Norman, Okla., perhaps with university officials on their way to court Sooners Coach Bob Stoops?"

Then, of course, there's the fascination of pulling up the flight plans and maps of doomed aircraft, like US Airways flight 1549 that landed in the Hudson. Even in Australia, newspapers were citing FlightAware and using its imagery of the plane's brief journey and descent into water.

A six-minute trip into the Hudson

FlightAware is an incredible mashup of radar data, air traffic controller records, flight plans, and maps. Like any great mashup, it provides the public with a new window into an opaque room, and usage has exploded—often in surprising ways. But this is no ordinary mashup; FlightAware sucks in more than 1GB of new data every day, tracking almost every commercial flight in the US almost in real-time, and maintaining a historical database that now tops 60 million flights. Getting the system up and running posed major scalabilty challenges, but the team behind the site solved them with open-source tools, a custom XML feed interpreter, and set of powerful servers.

I spoke with Karl Lehenbauer and David McNett, two of the three partners behind the site, about turning a mashup into a full-time job, handling so much data on a daily basis, and the business case for giving back to open source projects.

Backing into business

Karl Lehenbauer

FlightAware wasn't intended to be a job; it was meant to be a resource for the three pilot partners to use for tracking their own flights. It has "been a labor of love for us," says Lehenbauer, one that first launched on someone's personal computer, a machine with 1GB of RAM and a single-core processor. That was fine when a few friends used the tool, but when it was posted to an internal Microsoft mailing list for pilots, FlightAware jumped from 11 users to 1,000 users overnight. It was then that the team realized, "We need a real server."

As the public gradually became aware of just how useful it could be to have near-realtime access to flights, every new air incident brought in more users. When Nike's corporate jet—carrying all the top executives—had a landing gear problem and circled the Boston airport for three hours before touching down safely, FlightAware's user numbers jumped again, from "1,100 to 11,000." And so on.

But mapping tens of thousands of flights every day is no simple task. FlightAware's system first has to integrate numerous data sources, including a "firehose" of non-archived FAA data; if FlightAware's systems miss anything from the FAA stream due to downtime, network problems, or processing issues, that data is gone and can't be retrieved.

David McNett

Most of the data sources arrive in XML format and require substantial parsing before they can be used. There are obvious issues, such as the fact that three different datasets record latitude and longitude numbers in three different ways. But the problem runs much deeper; much of the data received is "very, very poor." Typos, conflicting information, out-of-bounds data —it's all common. Much of the information is typed in by air traffic controllers, working quickly; other position data might come from two different radar installations as a plane passes between zones in the US air grid, and the positioning data might not match.

"It's been fun" to figure this out, says Lehenbauer. "Fundamentally, it's a reputation challenge." Whose data do you trust, and when? The feed interpreter that makes sense of this mound of data is one of the company's most closely-held assets.

After undergoing its scrubdown, the clean data has to be stored. Jamming that much data directly into databases quickly led to database corruption problems, so the FlightAware coders wrote their own memory-resident database in C. The program holds onto the last 24 hours of activity in a couple gigabytes of RAM, only pushing it out into a PostgreSQL database for archiving after a day has passed (using Oracle was simply too expensive, and the team wasn't thrilled with Microsoft's database solutions).