While beta testing Talis's Kasabi, I got to wondering about the data publishing market: who out there is hosting raw data, potentially charging for it and passing money along to the data's providers? Poking around, I learned who the key names are. (Corrections welcome.) I accidentally stumbled across a few more when I followed a tweet from @xmlgrrl (a.k.a. Eve Maler, a friend of mine in the XML world since it was the SGML world) and started looking at her husband Eli's blog. His posting Ten services to get your cloud startup off the ground now mentioned a few more companies that provide raw data—one that even provides free RDF. I tagged a few with a delicious.com bookmark, but wanted to write out notes about a few here in order of how interesting they are to a semantic web geek.

Some general notes:

The more I studied, the more I found, but I didn't want to spend more than an afternoon on this.

These sites all let you download data directly. I didn't include sites like Data.gov that function more as directories that link to data sources on other sites.

Most of these providers have boosted their numbers of available datasets by including small datasets with as few as 100 records, and by hosting copies of data from the well-known names in the Linked Data Cloud. The advertised added value is typically the ease of programmatic access to that data.

Despite the title of this blog entry (I was tempted to call it "Data resellers", but many make the data available for free) I focused on a more narrow case of data providers: the redistributors that gather data from specific, identified places and then make it available publicly with attribution, not actual data sources themselves such as government agencies, university projects, media making their metadata available, and various other circles on the Linked Data Cloud diagram.

If I've quoted some companies' websites more than others, it's because they had "About" and "FAQ" pages that were easy to find and actually answered the questions I was wondering about.

The most interesting thing about Kasabi in this field is their commitment to providing data according to Linked Data principles, giving you SPARQL endpoints for data sources and the ability to define new APIs around each data source. The current data selection is interesting, considering that Kasabi is still in beta. For now it all looks like data that is freely available elsewhere, but the advantages of retrieving it from them go beyond the ability to use the SPARQL query language. For example, with BestBuy's RDFa spread out across many different dynamically generated pages on bestbuy.com, querying this data from BestBuy's server has a lot of limitations. Kasabi seems to have the BestBuy data aggregated so that their customers have more flexibility in how they query it.

While disintermediation was a big buzzword of the dot com boom, intermediation is now getting bigger.

I list Socrata right after Kasabi because RDF is one of their export formats, along with XML, JSON, CSV, XLS, and more. In a business that depends on finding both data providers and data users, their home page makes the clearest case about why someone should work with them as a data provider: they're clearly targeting government agencies who need to fulfill data transparency mandates. (Other providers are certainly targeting this market; just not as clearly.) The company info page calls them "The Leader in Open Data Services for Government". Another paragraph on the homepage makes a nice case for why developers should be interested in their data, and upcoming webinar titles of "Launch your own Data.Gov" and "Open Data as a Service Delivery Platform" are also pretty catchy to someone interested in this market.

Factual targets data users more than data providers on their current home page, telling developers "Access great data for your web and mobile apps". The only download format I could find was CSV, but with their emphasis on helping developers build apps, they focus more data delivery with their RESTful API. According to their FAQ, "Factual, Inc. is an open data platform for application developers that leverages large scale aggregation and community exchange... Factual's hosted data comes from our community of users, developers and partners, and from our powerful data mining tools... Factual offers several hundred thousand datasets across a variety of topics (with a deep focus in Local) aggregated from multiple sources, made easily accessible for developers to build web and mobile apps... Our APIs are free to everyone—if you want SLAs or have certain performance requirements, we would charge you a fee based on usage volume. Our downloads are free for smaller developers". A press release on Semantifi's web site shows that some big names and big money are behind Factual.

Infochimp seems to be one of the more well-known (and memorable) names in the field. From their FAQ: "Infochimps is a place for people to find, share and sell formatted data. Both users and Infochimps employees scrape, parse and format data so that it's easily accessible to you. We take the chimp work out of working with data so you can literally start building cool stuff in minutes... There is no sign up fee to use Infochimps. Some of the data sets available on our site are free. Some require attribution, and others are available for purchase. The first 100,000 data API calls are free. We offer subscriptions if you would like to use more... The data sets available through our API are 1.) hosted for you and 2.) scraped on a regular basis. ... Most of our data comes in tsv, csv or yaml format". The part about users scraping, parsing, and formatting highlights another aspect of the business model of some of these companies: crowd-sourcing the labor whenever possible.

AggData sells CSV files, typically of locations of all the stores in a particular chain. For example, a complete list of Cinnabon locations, with 454 records, costs $29. The description page for each data set lists the fields and lets you download a sample. Prices that I saw ranged from $9 to $49. According to their FAQ, you order a dataset, and when payment is confirmed they email you a URL for the data that is good for 5 downloads or 120 hours. Being founded in 2006 and therefore the oldest of these companies, AggData is the most low-tech (no APIs here) but it's a lot easier to look at their lists of franchise locations and churches and imagine that data being useful to someone than it is for many of the other data providers. Infochimps lists AggData as a "featured data provider", but lists the same prices for the same datasets, so I'm not sure whether they're just routing you to the same batches of data or making it available through their own APIs. (I got an Infochimps ID, clicked through for an AggData dataset until it asked me for credit card information, and stopped there.)

According to their About page, Semantifi "developed a meaning based search platform to search both structured and unstructured content and filed multiple patents". Along with the platform, they say that they have an "App Store like marketplace for a community of publishers to build data search apps" and that "Both Socrata and Factual are quite similar in concept and both lack the technology to search datasets like Semantifi". As far as I could tell, Socrata and Factual have a lot more datasets than Semantifi; the first three Semantifi links that I clicked to look into specific data sets went to an empty wiki page. (If I was clicking in the wrong place, that's not a great reflection on their site design. Also, with all of the people with hardcore financial markets experience on Semantifi's management page, why they need Google ads on their home page?) Perhaps Semantifi is less like data providers Socrata and Factual then they think and more like Open Data Directory, which doesn't provide actual data but instead a search engine for data spread out across other sites that they index.

I wanted to mention one other interesting source of fairly large-scale data to use in applications—when I learned how to add a volume for more disk space to an Amazon EC2 cloud image, I found that some of the volumes I could choose from included data from a choice of public data sets: DBpedia and Freebase dumps, the Enron email, US Census, Labor, and economic data, various biological data collections, and more. There is a list of such data on Amazon's website, but doesn't show all the choices; additional data sets include BBC Music and programs data. If you were going to jump into the data reseller market with the various companies described above, an Amazon image with some of this data would be one logical place to start your company.

A local friend Eric Pugh was recently pointing out to me the irony of how, while disintermediation was a big buzzword of the dot com boom, intermediation is now getting bigger. These data resellers are a good example. If you're going to insert yourself as a middleman between a data provider and a data user, it's a compelling case for either side to use your service if you have a lot of customers on the other side, but before you get there, you need to make your own compelling case to each side. Some of the companies listed above are better at doing this than others, and it will be interesting to see which of them are in business in five years and why they lasted.