How researchers enhanced Data.gov using semantic technology

A Rensselaer Polytechnic Institute team adds value to government data sets

A year after launching Data.gov to make government data available to the public, federal officials are still looking for ways to expand its use and the amount of data available on the site. But even the site’s supporters agree that converting much of data available on Data.gov into useable formats isn’t for the faint of heart.

A three-person team at Rensselaer Polytechnic Institute, however, has demonstrated how one approach can make greater use of the massive sets of data available on Data.gov, using the power of the semantic web. The conversion project has shown how quickly and inexpensively visualization and mash-up applications can be built from government data when it’s put into a web-friendly form.

Data.gov has grown rapidly since its launch a year ago, growing from 47 data sets to more than 250,000, according to federal Chief Information Officer Vivek Kundra, speaking at an event on May 18 to commorate the anniversary, May 21.

The Data.Gov project at RPI is part of the university’s Tetherless World Research Constellation web technology research initiative, led by Professor James Hendler. Hendler initiated the project in June 2009 after the launch of Data.gov. According to Li Ding, a research scientist at RPI, Hendler "came to me and said that he was looking at the information on Data.gov, and thought it was a great opportunity to use RDF (Resource Description Framework) and linked data," Ding said.

RDF is a standard model for data interchange at the heart of the semantic web. It uses web addresses—Uniform Resource Indicators —to specify the relationship between pieces of data. Even if the underlying structure of two data sets differs, they can still be linked using RDF.

Related Stories:

Data.gov shows how not to open government

10 flaws with the data on Data.gov

Data.gov sparks a quiet revolution

“It’s a great format for being able to structure data for the web,” said Dominic DiFranzo, a first-year doctoral candidate at RPI. “You kind of graft (things) together, make things become linked from one thing to another, and this data format allows us to use to link concepts—ideas inside these data sets—to other data sets in a very easy, intuitive fashion.”

The Data.gov data sets offered an interesting opportunity, DiFranzo said, because they were “all free and open to the public, so we had the rights to be able to change and work within it, and link it to other things. It was a great experiment in actually trying to take large data sets that different people from around the US government had curated and taken care of, and trying to mash those together (with data from outside of government) in the open data cloud.”

Li , DiFranzo and University of Chicago undergraduate Sarah Magidson formed the Data-gov project team, working to begin converting some of the high-value data sets in the Data.gov collection into RDF-enabled demonstrations.

“We're also trying to push the limit on different types of technologies,” said Li. “A lot of the demos we’ve built (on the Data.gov data) are based on very simple web technologies.” Once the data has been converted from its source to RDF, he said, it could be accessed by applications using a number of standard web technologies, including the SPARQL query language, JavaScript Object Notation (JSON). “In JSON, we actually have a very nice way to let existing web technologies consume this data and do these really cool (applications),” Li said. Developers can use the development tools they’re familiar with.

The result of the work is that it takes developers days, not months, to create applications based on the Data.gov data, DiFranzo said. “We've been able to, in such a small of time, have 40 demos up. Anyone can use this technology--they don't have to be a graduate student to make this technology work. I had undergrads who had never seen any sort of semantic tech and were able to pick this up in less than a week. So the technology has gotten to the point where general developers can start building these types of apps very quickly.”

Difranzo pointed to a demo application built using data from the Environmental Protection Agency’s Clean Air Status and Trends Network. “The CASTNET project has sensors all across the US measuring ground ozone and other pollutants,” he said. The data set had the readings from all the sites by name, said DiFranzo, “but there were no inclusions in the data of where the stations were located. So it sort of rendered the data meaningless.”



Once the data was rendered into RDF, DiFranzo said, “we were able to find through data outside Data.gov, a data set that describes where every site actually lives in the US. And we were able to link these two concepts together, and then we were able to do applications with a map actually displaying where the sites were, showing the aggregate values for ozone in areas. We were able to mashup with their own website, and other data we had on the history those values throughout a number of years.” DiFranzo and Li were also able to pull in data from other EPA sites about the sensor sites. The resulting demo application can be seen at http://data-gov.tw.rpi.edu/demo/exhibit/demo-8-castnet.php.

The EPA has its own visual interface into the CASTNET data. “However, we find that if the government exposed their raw data outside,” said Li, “we can do an even better job, because the whole point is that once we have the data, we are no longer limited by the visual data access interface. I want to see more on the map, but unfortunately the visualization was restricted by the government data set. We have to pull from several web pages.”

DiFranzo said the big takeaway from the research work is how much money and time could be saved on visualization projects and other development programs if more of the data was exposed in this fashion.

“Before, if you wanted to make visualization like CASTNET, you hired some contractor. They would spend a lot of time to determine the correct model for the database, and build this really high-end visualization. It would look really cool, and work really fast, and it would be awesome, but it would also cost a lot of money and take a couple of months. With these visualizations we do, since the data is in RDF, we're able to use off the shelf visualization technologies like Google Visualization API, and in a matter of days we can make these quick mashed-up data visualizations and applications. It's just a more rapid development cycle.



