×

A new open data initiative from the MIT Media Lab, Datawheel, and Deloitte offers CIOs a model for combining and representing disparate data sets.

Sometimes it takes a team of seemingly unlikely collaborators to make progress where others have struggled, and that’s exactly what a group of scruffy-haired software developers and freshly pressed consultants did with DataUSA. They created a website that aggregates data from multiple U.S. government agencies and makes it available to the public in an interactive, highly visual, and user-friendly format.

A collaboration between the MIT Media Lab, its Datawheel spinoff, and Deloitte, DataUSA is a web-based platform that brings together data sets from the U.S. Census Bureau, the Department of Labor, the Bureau of Economic Analysis, the Department of Education, and other sources—making it the largest, most comprehensive website and visualization engine for U.S. government data. It presents information on more than 36,000 locations, 300 industries, 500 occupations, and 2,300 college majors.

Built on open source software, DataUSA aims to make accessing and querying government data easier for the many individuals and organizations that use it. Businesses widely rely on government data to inform strategic real estate, talent, marketing, and other decisions. But this data has rarely been easy to find or manipulate, according to César A. Hidalgo, a professor at the MIT Media Lab and director of its Macro Connections group. Corporate economists, forecasters, and researchers have had to search multiple sources and websites for it, only to sometimes arrive at incomplete or poorly formatted data sets requiring specialized analysis tools, adds Hidalgo.

DataUSA can help CIOs answer a variety of talent-related questions, such as: Where should I consider locating a new IT operations center? From what colleges and universities should I recruit IT staff? What will demand for information security analysts look like in 10 years? In the past, answering these questions may have required months of research, but with DataUSA, CIOs can gain insights in minutes with just a few mouse clicks or screen taps.

Beyond using DataUSA as an information resource, public and private sector CIOs can take its source code to deploy the platform inside their organizations and plug their enterprise data into it. They can also take advantage of DataUSA’s APIs to develop new applications or integrate its data sets with their organizations’ existing analytics tools.

“Other open data initiatives have struggled to write APIs that developers could easily use to create new applications, so we put a lot of thought into our API design,” says Matt Gentile, a principal and analytics leader with Deloitte Transactions and Business Analytics LLP, and a co-lead sponsor of the project.

The project team also devoted considerable thought to the user experience, which further differentiates DataUSA from other public- and private-sector open data initiatives. The team solicited feedback from economists, data scientists, policy-makers, executives, and individual citizens, and sought to present data in a way that tells a story—whether about a location, industry, or occupation. “We wanted to create a platform that predigests data for users by giving it context and comparing it with other data sets,” says Hidalgo.

DataUSA is built on the PostgreSQL object-relational database system. The development team built a “data ingestion library” that imports, stores, and indexes data, converts it into APIs, and greatly facilitates the onerous cleansing process. The core logic layer interprets users’ queries, presents relevant data, and offers alternate data sets if the system lacks data to answer a specific query. The D3plus JavaScript library creates more than 1 million visualizations based on included data sets. Additional tools transform data into the text descriptions and captions that appear across the site.

To confirm the timeliness of data, the DataUSA team attempts to update the site with the current release of each component data set. “Most agencies contributing data to the project put a lot of effort into normalizing and validating it,” Gentile explains. For example, agencies including the Bureau of Labor Statistics frequently release revised numbers. “We appreciate the complexity associated with preparing these data sets, and we’re committed to importing them as soon as they’re available.”

Hidalgo and Gentile hope public and private sector organizations will contribute some of their data to the platform. Since DataUSA launched in April, they’ve received a lot of feedback, which they are now trying to incorporate into the next version. “We want DataUSA to become the go-to place for accessing and building on U.S. public data,” says Hidalgo. “We have high hopes for using open government data to address social issues and promote economic growth.”