For citizens of the US, the social security number (SSN) is the gateway to all things financial. It fills its government purpose of helping us pay our taxes and track our (in many cases, hypothetical) government benefits, and it has also been widely adopted as a means of verifying identity by a huge range of financial institutions. As a result, anytime you disclose an SSN you run a real risk of enabling identity theft. So far, most of the SSN-related ID theft problems have resulted from institutions that were careless with their record keeping, allowing SSNs to be harvested in bulk. But a pair of Carnegie Mellon researchers has now demonstrated a technique that uses publicly available information to reconstruct SSNs with a startling degree of accuracy.

The irony of their method is that it relies on two practices adopted by the federal government that were intended to reduce the ability of fraudsters to craft a bogus SSN. The first is that the government now maintains a publicly available database called a Death Master File, which indicates which SSNs were the property of individuals who are now deceased. This record provided the researchers with the raw material to perform a statistical analysis of how SSN assignments related to two other pieces of personal information: date and state of birth.

The second is that the government has centralized its handling of SSN assignments and provided documentation of the procedures. The first three digits are based on the state where the SSN was originally assigned, and the next two are what's termed a group number. The last four digits are ostensibly assigned at random. Since the late 1980s, the government has promoted an initiative termed "Enumeration at Birth" that seeks to ensure that SSNs are assigned shortly after birth, which should limit the circumstances under which individuals apply for them later in life (and hence, make fraudulent applications easier to detect).

That last program proved to be the key feature that allowed the new research, as it ensured that SSN assignments were more tightly correlated to date of birth. The researchers used the Death Master File to split out data from individual states (which determine the first three digits) then order them by date. At that point, they searched for statistical patterns within the resulting data.

Even from data before the 1990s, rough patterns were apparent in the assignment of region and group numbers but, by the mid-90s, it's obvious that, with a few exceptions, individual region and group numbers are used in a clear sequential order for most SSNs. The patterns are even easier to pick out in less populous states. Patterns in the final four digits were harder to detect, but the authors created an algorithm that predicted them with a lower degree of confidence.

The accuracy of these algorithms is positively disturbing. Using a separate pool of data from the Death Master File, the authors were able to get the first five digits right for seven percent of those with an SSN assigned before 1988; after that, the success rate goes up to a staggering 44 percent. For a smaller state, like Vermont, they could get it right over 90 percent of the time.

Getting the last four digits right was substantially harder. The authors used a standard of getting the whole SSN right within 10 tries, and could only manage that about 0.1 percent of the time even in the later period. Still, small states were somewhat easier—for Delaware in 1996, they had a five percent success rate.

That may still seem moderately secure if it weren't for some realities of the modern online world. The authors point out that many credit card verification services, recognizing the challenges of data entry from illegible forms, may allow up to two digits of the SSN to be wrong, provided the date and place of birth are accurate. They often allow several failed verification attempts per IP address before blacklisting it. Given these numbers, the authors estimate that even a moderate-sized botnet of 10,000 machines could successfully obtain identity verifications for younger residents of West Virginia at a rate of 47 a minute.

All of that requires that the botnet master have access to date and place of birth information, and a number of commercial services will happily provide that data for a price. But the authors also point out that it may not be necessary to pay; they cite a publication in progress that indicates it's easy to harvest a lot of that information from social networking sites like Facebook.

PNAS, 2009. DOI: 10.1073/pnas.0904891106