Last week we published a big update to Dollars for Docs, our interactive news application of payments made to U.S. health-care providers by 15 pharmaceutical companies. Compared to when we launched the project in 2010, the amount of data we’re collecting has grown enormously: The list of payments increased from around 750,000 to almost 2 million, and the grand total of the payments grew from around $750 million to just under $2 billion.

Compiling the data for it has been an enormous project right from the beginning. After we published the first version, the original developer on the project, Dan Nguyen, compiled all of the things he had to learn into a guide to scraping data. This year’s update took more than eight months of full-time work by me, working with other news-app developers, and at times with our CAR team, a researcher, two editors and two health-care reporters. It was a massive effort and presented huge technical and journalistic challenges.

After we launched, my editor pulled me aside and asked what was so hard about Dollars for Docs. What follows is my answer.

PDFs Considered Harmful

To build the database that powers Dollars for Docs, we had to convert 65 gigantic disclosure reports from some other format into raw structured data. Each company releases its data in a slightly different way, but broadly speaking there are two ways they present the data: A few build websites to publish the data, and many more disclose their payments in humongous Adobe Acrobat PDF files.

The trouble is, PDF was not designed as a data format. It was designed as an "electronic paper" format. That is, "something whose contents (text and 2D images) would look the same on any computer at any time," Adobe Senior Product Marketing Manager Ali Hanyaloglu told me.

PDFs are a format engineered to present elements in perfect fidelity to their creator's intentions. In their most basic form, PDFs don’t know what tabular data is. They don’t even know what words are.

Here’s how a PDF works, deep down: It positions text by placing each character at minutely precise coordinates in relation to the bottom-left corner of the page. It does something similar for other elements like images. A PDF knows about shapes, characters and their precise positions on the page. Even if a PDF looks like a spreadsheet -- in fact, even when it’s made using Microsoft Excel -- the PDF format doesn’t retain any sense of the “cells” that once contained the data.

Adobe, which invented the PDF in 1993, acknowledges its shortcomings when it comes to data. “For the person who wants raw data, PDF isn’t the right choice," wrote Adobe Senior Principal Scientist Jim King in a blogpost. In 2000, Adobe made an updated version of the PDF format with an easy way to attach files so that original data could be published alongside an attractive presentation.

Incidentally, Adobe made doing this really easy. We made a screencast showing how. According to Hanyaloglu, it can be done using Adobe Acrobat, Microsoft Office or even with the free Adobe Reader. I’m not quite sure why I don’t see PDFs with attached data in the wild more often. In any event, none of the 50 PDFs we used to compile Dollars for Docs database attached the raw data.

For each PDF, we had to reconstruct the raw data, and it’s not an easy thing to do. We don’t know of any off-the-shelf software that did precisely what we needed to do, so we wrote our own. We named it "Farrago." The first version was built by David Frackman, a New York-based developer who recognized this as a problem that could be addressed with a computer science discipline called computer vision, which is used for things like facial recognition and automatic license-plate readers.

Farrago reconstructs the data from a PDF by considering each character one by one to determine which column it belongs in. It uses a computer vision technique called a Hough Transform to detect the vertical black lines in the background of the PDF that visually separate columns in the table.

But not all PDFs have borders around each table cell. For these PDFs, we had to visually guess the locations of these dividing lines and adjust the code again and again until it came out right. Unfortunately, there's no easy way to check if tens of thousands of lines were all copied correctly. Many times we had to reprocess a file because one person's long name was inadvertently split, with most of their name in the name column and the last few letters in the next column. For example, an incorrect column dividing line location for a payment to a doctor with the last name "Ingemar" in Salt Lake City might end up listing "INGE" as a last name and the city as "MARSALT LAKE CITY.”

Even when Farrago has an accurate model for which columns everything belongs in, things aren't easy. As I mentioned above, each letter is positioned on the page independently from other letters in the documents. That's more inconvenient than it may sound: While letters like "a" or "c" have about the same location on the y-axis if they are on the same line, letters like “p” or "g" with descenders are a tiny bit closer to the bottom of the page than letters without descenders, like h and m. Punctuation marks like commas and periods also behave unexpectedly, for the same reason. Farrago "rolled up" the letters within an arbitrary number of pixels of the baseline into a single line of text.

Here’s another mystery we found. Some PDFs we processed seem to contain duplicate lines of text that weren’t there if you just looked at the PDF. We had no idea why. The PDF specification, which is the official recipe for how to write a program to display, print and create PDFs, made no mention of this phenomenon.

A paper by Tamir Hassan, a computer scientist whose research focuses on table recognition and information extraction, may have solved the mystery. “Characters (or complete strings),” he wrote, “are sometimes overprinted with a slight offset to simulate boldface type. As long as these instructions follow another, they are automatically detected and represented by a single text fragment with the boldface flag set to true.”

Finally, because PDFs don’t know about words, they often don’t include a space character separating them. Instead, PDFs simply locate each group of letters a few pixels to the right of the one that precedes it. A reader can’t tell the difference, but to a computer the difference is a big pain. We used some PDF-reading software libraries that made sometimes imperfect guesses about where to put spaces in.

Even after we had Farrago doing the heavy lifting, it often took hours to precisely tune how it processed a given PDF, and it sometimes took days to discover subtle problems and write a fix or workaround.

Breaking the Web

If they follow best practices, websites are easy to turn into structured data. But pharmaceutical companies' websites don't always follow best practices. Some clumsily reinvent the basic mechanics of the Internet and make “scraping” the data out of them exceedingly difficult.

One company's disclosure site spanned 4,000 web pages that all shared a single URL. That is, there was no way to get to page 3,000 without viewing page one, then page two, then all the intervening pages through page 2,999. This wouldn't be so bad, except the company's web servers often got overwhelmed and returned a broken page, forcing us to start over.

After I re-engineered the code to work around the site's limitations with my colleague Jeff Larson's help, the server would still occasionally get confused, and in unpredictable ways. If we requested, for instance, pages three and four, the server might return pages labeled "3" and "4" but both containing the data intended to be on page four.

Another company's disclosures seemed, at first, to all be on a single 29.6 megabyte web page. It took more than two and a half minutes to download and render in Google Chrome -- on a very fast Internet connection. But it turns out this behemoth of a web page did not even include all of the disclosure data. It included only the total amount paid to each payee. The details about each payment were on a fragment of HTML that appeared when a user clicked a link. We had to download all 81,335 of these separately and merge them with the list of payees so that we could report which portion of each payment was for meals, consulting, speaking, travel, etc.

It Gets Worse

Even when all the data was downloaded, we faced a row of hurdles.

There is no standard way for companies to define each kind of payment -- at least, not yet. The 15 companies describe categories of payments in different ways and some even define categories differently between their own reports. For instance, while many companies disclose the value of meals given to doctors, Merck combines meals, travel expenses and royalties, so there is no way to break out the value of meals paid for by Merck to compare it to similar expenses disclosed by other companies. In all, we tracked 155 distinct category definitions across all of the companies in Dollars for Docs. The specific definition created by each company for its payment categories is available as a “tool-tip” next to each payment on Dollars for Docs payment pages.

Another hurdle was dealing with the small number of companies that disclosed payments as ranges rather than precise dollar amounts. For instance, Allergan reported paying between $1 and $1,000 to a doctor named Aaron Jacob Friedman in Lafayette, La., for "business meals" in 2011. There is, of course, no way of knowing if that was lunch at McDonald’s for $3 or dinner at Le Cirque for $999. Multiplied by the 95,582 payments in Allergan’s disclosures, this was a big problem. In the end we reported the ranges as we found them and excluded all the ranged payments from our aggregate numbers, like state and company totals.

Because any doctor's name is likely to be reported differently by each drug company, we had to be careful about how we reported what was paid to each doctor. Some companies list middle initials and others don’t. Some checks appear to have been written out personally to a doctor, others to the doctor's practice. Some doctors' addresses are listed in multiple cities. Manually researching millions of records to determine which slightly different names are really the same person and which doctors share a name but are different people would be an impossible task. We chose not even to attempt it and instead followed the example set by the Center for Responsive Politics' OpenSecrets site. We show names as we find them and depend on users to know which records in a search result page refer to their doctor.

For doctors who stood out -- like the 22 doctors who earned more than half a million dollars -- our reporters confirmed that the payments were all to the same person.

All of this made comparing the data and drawing conclusions very complicated. For instance, because we couldn’t with high confidence group all of a doctor’s payments together, we couldn’t report the highest paid doctor in each state. Luckily, some of the news organizations who’ve done stories based on Dollars for Docs data did the necessary legwork and do name the most highly paid docs. We also could not compare some companies payments to those made in a previous year. For instance, Merck's 2012 payments can’t be compared to its 2011 payments because they changed what types of payments they disclosed. Nor could we compare companies payments against each other, as each company discloses different types of payments.

Making Sure We Get It Right

After the data was collected, we worked very hard to make sure that every data point in our database was accurately reconstructed from its source. The techniques we used are detailed in my colleague Jennifer LaFleur's data bulletproofing guide.

During more than a dozen rounds of spot checking, my fellow ProPublicans used statistical sampling to make random selections of hundreds of records at a time and manually compared them to the record in the original PDF or website. We also manually recomputed every sum and average -- for aggregates like companies and states -- using Microsoft Access and compared them to the version calculated by our Ruby code.

We also used a SQL “GROUP BY” query to get all of the unique state abbreviations in the data set. This helped us locate errors like "M I" (with a space) instead of "MI" (without the space) for Michigan. We then sorted every column to make sure that it didn't include numbers in alphabetical columns, like city names, or letters in the amount column.

Another important component of our data-bulletproofing system was keeping our code and data well-organized. Each step of the data-cleaning process had its own script. For instance, one script used Farrago to scrape a PDF and filtered out header rows. For each step, we saved each of these scripts' output. This made it easy to track down errors. If, during one of our spot checks, we found a line that was missing in our final data that was present in the source data, we could look at each step of the data-cleaning process for that document to pinpoint which script had introduced the problem. After we had fixed the problem, the whole process could be re-run with just one command.

Help is On the Way

The Physician Payment Sunshine Act may resolve many of the problems we had. A provision of the Affordable Care Act, the PPSA will require pharmaceutical companies and medical device manufacturers to disclose their payments to doctors directly to the government, which will in turn put them on a public website that makes the data "downloadable and easily aggregated" -- that is, as structured data. (Speaking of disclosures, here’s one from us: ProPublica has been consulting with Deloitte, who is bidding on a contract to build that website.)

Drug companies and manufacturers will also be required to disclose, on an established timeline with specific definitions, precise payment amounts and to disclose a doctor's middle initial, if they know it. Companies' disclosures will also have to include each doctor's National Provider Identifier, which is a unique numeric identifier for medical professionals, though that number won’t be made public. So when PPSA is fully implemented in mid-2014, the payments will be comparable between companies and across time periods.

Until then, we’ll still be here doing things the hard way. Despite all of the problems, we believe strongly that Dollars for Docs is an important tool to ensure that Americans can have full and frank discussions with their doctors about their financial relationships.

One More Thing

Manuel Aristarán, a Knight-Mozilla OpenNews Fellow at La Nación, a newspaper in Buenos Aires, Argentina, has been developing a tool called Tabula for extracting tabular data from PDFs. Tabula's interface and algorithm for assembling characters into lines of text are much more developed than Farrago’s, so we’re helping fold Farrago into Tabula. Look for an open-source release of Tabula very soon.

Update: Tabula has been released.