Panel debate, ESWC 2013, Montpellier

This was the theme for ESWC 2013, so it's clearly a subject on people's minds. There was even a panel debate about it, at which I found my thoughts on the subject to be very different from those presented by the panel speakers. So I decided to do a little write-up of my thoughts to add to the conversation.

The first problem here is that the term "Big Data" is used in two different ways, as I've pointed out before. Firstly, there is the "too much data to process by conventional means" meaning. Secondly, there is the "to analyze data with Big Data techniques" meaning, which really boils down to doing data science with machine learning.

As far as truly big data sets go, I don't really think semantic technologies have that much to offer there. For one thing, existing solutions don't really scale into bazillions of triples. For another, really, really big data sets are generally pretty simple in structure, and without that much in the way of useful semantics. Triple stores do offer flexible, schemaless storage, but so do the NoSQL stores already in use in this area. As for reasoning on instance data, that's not feasible at these scales yet.

However, that is not to say that there are no connections. Reading research papers and seeing the work presented at ESWC 2013 I see machine learning techniques used in quite a lot of the papers. Very often, it's used to infer new, clean data out of noisy input data, or to link together data sets. So there is some overlap with machine learning, at least.

Place de la Comedie, Montpellier

As we've seen with IBM's Watson, the data sets produced as part of Linked Open Data can usefully be deployed in deep learning and machine learning contexts. There hasn't been very much of that as yet, but as more and more general, museum/library, and scientific data gets published, this is likely to increase. I've already seen business cases for reusing this type of data in a publishing context, and there's likely to be many more such examples. Semantic technology use may well follow the use of the data sets into these areas.

When it comes to data science in practice, a key problem emphasized everywhere is finding usable data sets, whether in the organization or outside it. Here, semantic technologies can help, since the biggest problem with data sets is often finding out what's in them. Semantic technologies make it possible to describe both the data itself and the connections between data sets in ways which are beyond the capabilities of any other technology. Such descriptions can make it vastly easier to navigate the data mess inside organizations.

Semantic technology also makes it much easier to integrate available information for analysis, but here there are multiple, competing approaches, and this is a subject worthy of a blog post of its own (not written yet). Suffice it to say that while semantic technologies have a lot to offer in this space, here they are in competition with a more pure Big Data approach.

I think the biggest weakness for semantic technologies in this regard is that most of the approaches in use so far assume clean data. What I mean by this is that reasoners, query engines and so on do reasoning without taking into account that data may be wrong, contradictory, and uncertain. Machine learning techniques do much better at this so far. I know there has been work on probabilistic and paraconsistent reasoning, and in my opinion this needs to be emphasized a lot more, because real-world data is really, really dirty.

So my conclusion is that Big Data and Semantic Web technologies are partly in competition, partly complementary, and partly fairly far apart. What is clear, however, is that machine learning is deeply relevant to semantic technologies, and I predict that we'll be seeing much more of machine learning in this space.