Rufus Pollock, co-founder of the Open Knowledge Foundation (and Cambridge University economist) talks about the importance of making data and information open and useful to the world.

Jed Sundwall: What is the Open Knowledge Foundation?

Rufus Pollock: We were founded in 2004. At the time, things were less developed than they are now and we had a simple purpose: to promote open knowledge, open information. We used the term knowledge because the aim was to go beyond software. We wanted to open stuff that wasn't code. Of course, the distinction between code and other kinds of information is not always a very sharp one, but we felt there was a lot that you could take from the experience of the free and open source software communities and you could almost port directly to other areas, be that science, be that economics, be that geodata, etc.

The foundation itself is there to promote open knowledge, promote means of opening knowledge, tell people what it is for knowledge to be open and why it's a good thing. The Foundation runs events, build tools, facilitate communities, etc

I'm one of the directors of the foundation and I also helped found it. We're fairly open in how we run the foundation, it's pretty peer based, so people are welcome to come in and start projects. They can say, "Well, I want to do this kind of project," and if it fits within the overall purpose they know what they're doing then they can go ahead and start working. In that sense it's a fairly loose governance structure, more or less modeled like the Apache Software Foundation. There again, they have a kind of core but they also have a structure where people come along and run projects within the organization but fairly autonomously.

I imagine the general consensus is that as much stuff as possible should be open, especially when you're talking about knowledge and information, educational tools. Do you come up against a lot of people fighting that idea?

Of course. You might think that openness, particularly among people in government, let's say, would be a consensus position. But this is not the case. One big problem I think is that, in general, but among policy-makers especially there is big tendency to equate price and value - or income and welfare to use the more technical terminology. This immediately places open approaches at a big disadvantage because there tends to be less immediately measurable 'income' even though the value generated overall for society may be higher.

Could it be that some policy makers are in bed with a lot of people who make money off of this kind of information?



Possibly, but I don't think it's not that simple. It's often more about narrow-mindedness or just not being open to alternative ways of doing things.

Would you mind sharing with me any examples of particular successes that the foundation has enjoyed and/or particular projects that you've produced?



One of the early things we did was define what we mean by openness. It might seem minor but it's a big issue. It's important because by having a good definition of openness we are ensuring we have a real commons of information, a real commons of knowledge with all of the benefits of reuse that implies.

It's particularly important because there has been a fair amount of debate and I think after that debate, it's quite muddy. To take the most obvious example, take a look at Creative Commons. Often people chat and say, my stuff is Creative Commons. But that doesn't mean a lot in the sense that are several Creative Commons licenses, some of which are mutually incompatible and some of which, the non-commercial ones especially, are definitely not open in the sense 'open' in open software

So one big thing we have done is developed a standard the Open Knowledge Definition. This takes the principles from free/open software and applies them to information, data, knowledge, etc. This is important because we don't currently have a clear sense of what openness means in these areas. And, more importantly, we're advocating for a standard that will allow people to communicate and share. Our hope is that we can plug open material together with other open material, knowing that the different sources of material all share the same freedoms. Currently, it can be quite costly to put together lots of different material because we need to sort through the different licenses protecting everything.

Another thing we did, which is more a tool or piece of infrastructure, is the Comprehensive Knowledge Archive Network which I think you mentioned earlier. It's one step, but we think an important one on the road to packaging knowledge and making it truly reusable. What do I mean by packaging here and why is it important?

Well, one day soon we're going to have a lots of material that is open and what's really exciting about open stuff is that it can easily be shared and recombined. That means we can break very complicated problems down into small bits, which people can manage. But then, we can put it back together again. So, let's say you were interested in U.S. unemployment, a hot topic, and you're interested in understanding how it changes. Maybe there's a data site out there just on unemployment itself. But maybe there's another one on house repossessions or the housing market, and then, there's another one on manufacturing. There are a whole bunch of different data sites.

Now, maybe one person could just maintain them all but that might become too big a job. You may need expertise in the housing market to maintain the housing data site, but you really want to bring these together often when you want to do analysis, or compute things, or make pretty pictures, or whatever it is you want to do. This is very similar to building a large building, let's say, or developing an operating system plus all the applications to use. Maybe one person could build them all and make sure they all work together but that would be quite a big task. Even the world's greatest monopolist struggles to do this effectively.

So, the typical way we go about doing this is by exploiting divide and conquer. But when you divide stuff up, there was this question about how you bring it back together. So then, we say we're moving toward a world where you can start getting lots of these data sets and then start putting them out there in the world. They can just start taking this unemployment data or this housing data. But, how do you find that and how do you get a hold of it? So often in software, there's been this tradition of building some kind of registry where you can find things, and then you start to impose some structure on that material, you start packaging. So rather than just saying: here's my website, here's my Wiki, look, there's lots of data on it, you are going to start packaging that data in a slightly more structured form.

The point of CKAN is to start saying, look, there's a better way than just having our stuff in wikis or in some random form on a website. We can start registering this material, and packaging it up a bit. That way other people, when they want them, can come and get hold of them easily and wheel of reuse can start to turn.

Well, the thing that's always been like a mystery to me when working with developers, who build APIs is how difficult is it to make one of these sets? I mean, it seems to me like it would be this ridiculous task. I mean, I'm sure they have ways to do it but, for instance, I'm looking on here and there's U.S. Census data but it's been put RDF. So I'm presuming they just took this raw data somehow and wrote a script to put it into RDF.



Yeah, they did. One billion RDF triples: http://www.ckan.net/package/read/2000-us-census-rdf

So, is this something that just any savvy engineer can do in their spare time when they feel like it?

Yes. Let's say you start with people who write programs, and they write the programs on their own, and the programs work. Gradually, people start to grab other people's programs and use them in theirs, at which point, those other people's programs, they need to start being a bit standardized. For example, having documentation, have an API etc.

The point about the 2000 U.S. Census RDF is that there's a website out there where you can look at the census data. You could just go look it up, but if you actually want to do some analysis, you'll want the data in a form that's more usable and RDF is a good web standard format which can represent complex and rich data but in a standard way.

This is just the beginning. It's fine to have these data sets, but what really gets complicated is dependency tracking. For example, a guy produces census data, and you say, OK, I've produced another set of data, but this time, I've changed the format because I've added in another layer of information. Then, it turns out that someone's written code that works on the old data but not the new data, and someone needs to reconcile this. How do we track these kind of dependencies? Without some system of tracking changes to data, divide and conquer grinds to a halt. One person or one organization has traditionally had deal with all of this tracking, but that hasn't been very effective as a developing method.

Right, and the idea is that it is a principle of open source software that people will create and manage data sets because they need to use them.



Absolutely.

I guess what you're encouraging is that they house them in your repository, or at least somehow centralized them?



Well, something we don't support us actually housing all the data in our repository, because raw data, compared to other software is a lot bigger. While we have some gigs or maybe getting on a terabyte, there's just way more data than that. So we encourage people who need to store data to stick it on archive.org or somewhere similar with CKAN acting as a registry.

So, the point more of a registry than a storage service. CKAN stuff is obviously opened software and it would actually be really great if other people were to start and say, well, we'll host a CKAN and we'll kind of federate some of this data so we kind of share it back and forth.

It's a great service, and one that I've wished governments would start investing it. We'd save a lot of tax dollars if we simply allowed private citizens access to raw data and let them build useful websites or applications with it.

One thing I've always found odd in life is what other people seem to find interesting or not. We sometimes talk about something we call "Shiny Front End syndrome" where people put all the effort into building the website front-end rather than realizing what's really primary is the data behind the website (We blogged about this in November 2007 in a post titled "Give Us the Data Raw, and Give it to Us Now.") In such cases the data is made secondary and what's primary is their particular website representation of it. A lot more money is actually spent on the website than the actual structuring the data or even creating a "read me" or a license file for the data and making it available to download.

Going back to CKAN, only about one in five packages currently has a download URL. We have listed, let's say, 250 that are on sites or even data collections, but about 80% of those don't actually give you a way to get all the data at once in a convenient manner. It might be that you have to go to some web interface where you can type in your query or whatever it is.

And then, of course, this also applies to non-profits or, like, research organizations.

They have it bad, I mean, they have it really bad. Talk about science, most science people don't give out the data. They give out the research results, i.e., the summary of the data. There's huge incentive to do that: "A," it's actually secrecy, you don't want to give out your data because you can keep using it, and "B," it's much harder to check whether you've did some weird stuff with the stats to get your nice results if your data isn't actually available there, right? So there are large incentives to not put it out there, but it's terrible for non-profit research organizations.



But we also need to remember that most people aren't geeks. Ultimately they read research reports and they have no interest in interacting with raw data. They don't see the benefit of raw data. One thing I actually emphasize is don't get too technical with people. Just say, it has to be in the form of X, just say any kind of dump, an Excel spreadsheet, anything is good as long as it's not a PDF.

Exactly, because as I said earlier, any savvy developer, if you give them something that's even moderately structured, give them an Excel spreadsheet, they'll be able to figure out how to parse it and put it into something they can use.

Absolutely. And again, it's worth reemphasizing that people need to release their data with a license (hopefully an open license) and a "read me" explaining the data. Ideally this will become as second nature as releasing a research report, it's something that would just come along with the report.

Again, if you just look on CKAN, there are so many packages, which are tagged with license not specified. Data has intellectual property rights in many parts of the world just like copyright. It's a really big thing to put a license on. In some cases, we can say this is the U.S. government so it's probably open, public domain stuff, but it's best to have a license on everything so we can be sure.

To take an extreme example, the Library of Congress obviously has a lot of bibliographic data. We've been very interested in that for a while. But apparently, they often try and charge for that data outside the U.S. because they say we have to give it out freely here within the U.S. but that doesn't mean that we don't have copyrights or database rights outside of the United States. So it's really crucial that you get licenses on data and make it very clear what the situation is.

It's immensely difficult because not everybody's a lawyer, and even people that are lawyers who deal with licensing and intellectual property laws have a hard time with this stuff because it's completely Byzantine. I had never even considered that open domain in the US might not apply internationally

.

I understand that but in some sense, it's very simple in that people don't have to understand all that complexity if all they want to say is: you're free to do what you want with this. We've also made an effort to make this simpler for people by creating a list of appropriate licenses etc. So to repeat, it is as simple as: "You need a license and here are the licenses that you should use." So no one really needs to go away and worry. They can just take one of the licenses copy the text. So, in that sense, you don't need to get involved with all the complexity unless you want to

It is also important not to get scared of the complexity and let someone else sort it out. People get scared, throw up their hands, and give up. That's very bad because the default, sadly, is a proprietary arrangement rather than an open one

Lastly, I think I should say that this is especially important because of other changes. In particular, one of the interesting things I think that's happening at the moment is that people are moving from applications to services in a big way. One of the weird things is that in doing so, they're moving back to more proprietary modelsâ€”while a lot of software is now open source, many services are closed and proprietary. Even if they're free to use, they're proprietary services in the sense that their portions of their code is proprietary and the data is proprietary. Everyone can use Google Maps for free, but the data has copyright on it and, basically, we're getting it free because of Google's generosity. If at any point they decide to not to give it away, it's gone.

So I think in that sense, it's interesting times in that there's both a lot of openness, but there's also, kind of, this movement back, in some sense, the other way.

But, you know, here's the economist in me, all that really matters is the utility that everybody else is getting out of it. The utility's definitely increased but, you're right, there is sort of this threat hanging over it.

The question, obviously that I would ask is most information goods, be it software or online services, have always been under a model of cheap now, dear later, right? When Microsoft launched its original product, it was probably a twentieth or less of the cost of a new machine. But now, a Windows license could be equal to 50% of the cost of the laptop.

Software developers want to get all of their users in at the beginning and they're willing to do that for free. The question is whether there is one day some kind of reckoning regarding that. If it's true that all your customers could just switch for zero cost to anybody else, you're not going to make that much money because there's only so much rent you can extract from them. I mean, OK, you can show them adverts but if your adverts get too intrusive or annoying, people will switch elsewhere. So there is a question you might ask which is, given evaluations from these companies, people must be anticipating that people can't switch quite as easily as they might imagine. So, did you actually try moving all your stuff from one place to another, etc.? Anyhow, it is an interesting question for the future.

Could you make three suggestions for what people can do to improve things?

First: if you're getting data together or material together, please license it and please license it in an open manner wherever you can. Of course, there are some situations where maybe you can't. Maybe you've got to sell it or there's some licensing deal the people who gave you the data or whatever. But wherever you can, license it and license it openly.

Second: give out raw material. Give out raw data. Don't be scared about doing that and don't worry, don't start getting too worried about the tech stuff of do we need an RDF, do we need it in Open Office versus Word, or all this stuff? Just give it out in the simplest way possible for you to start with. (Maybe it will turn out to be useless to people, in which case, it's good that you spent no effort doing stuff and if it does turn out to be useful and lots of people want it that will be the motivation to put it in some more useful format, or even better, someone else will do that for you for free).

Third: please come along to CKAN. Anyone can come along to CKAN and register a source of data, you don't even need to sign up. You can say, I know about this set of material X and here is it's URL and here're some tags. Anyone can come and do that. And even better, if you want to get more involved, come and be a maintainer and turn some data into a really usable package for other people.