Overview

This article is one of a series describing specific applications of the Real Semantics system. In particular, it describes how Real Semantics builds the Ontology2 Edition of Dbpedia 2015-10. In this particular case, we import RDF data directly from DBpedia into a triple store (OpenLink Virtuoso) without transformation. However, the sheer bulk of the data, and the time involved, forces us to use sophisticated automation to produce a quality product. Working together with the AWS Marketplace, we can provide a matched set of code, data and hardware that can have people working with a large RDF data set in just minutes.

We'll start this article by describing one of the challenges of Linked Data: that is, how to publish and consume data when the costs of data processing, storage, and transfer start to become significant. We explain how cloud publishing lets us square that circle, coupling the costs of handling data to the time and place where people need it. We finish off the business case by considering another case where cloud technology changes the economics of computing and can make formerly impossible things possible.

I (Paul Houle) started making cloud data products long before the development Real Semantics, so I talk about the history of those efforts and how they contributed to the design decisions behind henson, the component of Real Semantics that constructs data-rich applications on cloud servers. Although Real Semantics works just fine on an ordinary computer, it is nice to be able to call upon cluster and cloud resources when necessary, and essential to be able to package code and data reliably for deployment to end users. Finally, we discuss the differences between the AWS platform targeted by Real Semantics and alternatives such as Microsoft's Azure and Hyper-V, as well as container-based systems.

Linked Data and it's discontents

Big Data is a popular buzzword, but how many people are actually doing it? I got interested in the semantic web years ago, when I was making the site animalphotos.info; back then I was doing the obvious thing, making a list of animal species, then searching Flickr for pictures of the animals. I had a conversation with a Wikipedia admin, who turned me on to DBpedia. Between DBpedia and Amazon's Mechanical Turk I no longer needed to make a list or look at the photos, but instead I could import photographs with a structured and scalable process.

In this time period, I went from exploiting general purpose RDF data sources such as DBpedia with traditional tools to my current focus, which is using RDF tools to exploit traditional data sources. Still, at Ontology2 we use DBpedia and Freebase to organize and enrich traditional data sources.

People face a number of challenges using Linked Data sources, such as:

Handling the sheer bulk of the data

Understanding what data is there

Making effective queries against the data

Understanding and mitigating quality problems in the data

If you think these problems are bad for DBpedia, think of how hard it is to get a complete view of what's happening at a large corporation!

Understanding the data that is there is difficult with the "dereferencing" approach where you go to a URL like:

and then you get back a result that looks something like:

dbr:Linked_data a ns6:Concept , yago:CloudStandards , wikidata:Q188451 , dbo:TopicalConcept , yago:Abstraction100002137 , yago:Measure100033615 , dbo:Genre , owl:Thing, yago:Standard107260623 , yago:SystemOfMeasurement113577171 ; rdfs:comment "Linked data is een digitale methode voor het publiceren ... de techniek van HTTP-URI's en RDF."@nl , "O conceito ... explorar a Web de Dados."@pt , "In computing, linked data ... can be read automatically by computers."@en , "键连资料（又称:关联数据，英文: Linked data）... 但它们存在着关联。"@zh , "Le Web des données (Linked Data, en anglais) ... l'information également entre machines. "@fr , "In informatica i linked data ... e utilizzare dati provenienti da diverse sorgenti."@it , "En informática ... que puede ser leída automáticamente por ordenadores."@es , "Linked Data (связанные данные) ... распространять информацию в машиночитаемом виде."@ru , "Linked Open Data ... では構造化されたデータ同士をリンクさせることでコンピュータが利用可能な「データのウェブ」の構築を目指している。"@ja ; rdfs:label "Dati collegati"@it , "鍵連資料"@zh , "Web des données"@fr , "بيانات موصولة"@ar , "Linked data"@ru , "Linked data"@nl , "Linked Open Data"@ja , "Linked data"@en , "Linked data"@pt , "Datos enlazados"@es ; dbo:wikiPageExternalLink <http://demo.openlinksw.com/Demo/customers/CustomerID/ALFKI%23this> , ns26:LinkedData , <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3121711/> , <http://knoesis.wright.edu/library/publications/linkedai2010_submission_13.pdf>, <http://www.edwardcurry.org/publications/freitas_IC_12.pdf> , <http://knoesis.wright.edu/library/publications/iswc10_paper218.pdf> , <http://virtuoso.openlinksw.com/white-papers/> , <http://nomisma.org/> , <http://www.semantic-web.at/LOD-TheEssentials.pdf> , ns27:the-flap-of-a-butterfly-wing_b26808 , ns25:book , <http://linkeddata.org> , <http://www.ahmetsoylu.com/wp-content/uploads/2013/10/soylu_ICAE2012.pdf>, <http://www2008.org/papers/pdf/p1265-bizer.pdf> , <http://www.community-of-knowledge.de/beitrag/the-hype-the-hope-and-the-lod2-soeren-auer-engaged-in-the-next-generation-lod/> , <http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/> , <http://knoesis.org/library/resource.php?id=1718> , <http://www.scientificamerican.com/article.cfm?id=berners-lee-linked-data> , <http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkingOpenData.pdf> ; dbo:wikiPageID 11174052 ; dbo:wikiPageRevisionID 677967394 ; dct:subject dbc:Semantic_Web , dbc:Internet_terminology , dbc:World_Wide_Web , dbc:Cloud_standards , dbc:Data_management , dbc:Distributed_computing_architecture , dbc:Hypermedia ; owl:sameAs dbpedia-ja:Linked_Open_Data , dbpedia-ko:링크드_데이터 , dbpedia-el:Linked_Data , dbpedia-es:Datos_enlazados , dbpedia-it:Dati_collegati , dbpedia-nl:Linked_data , wikidata:Q515701 , dbr:Linked_data , dbpedia-pt:Linked_data , dbpedia-fr:Web_des_données , dbpedia-wikidata:Q515701 , <http://rdf.freebase.com/ns/m.02r2kb1> , dbpedia-eu:Datu_estekatuak , yago-res:Linked_data ; prov:wasDerivedFrom <http://en.wikipedia.org/wiki/Linked_data?oldid=677967394> ; foaf:isPrimaryTopicOf wikipedia-en:Linked_data .

Now this is pretty neat (particularly in that there is a lot of multilingual information), which is invaluable if you are working on global projects such as LEIs) but you are looking at the data through a peephole. You have no idea of what other records exist, what records link to this record, what predicates exist in the database, etc.

A popular response to that is the Public SPARQL Endpoint, which lets you write SPARQL queries against a data set. SPARQL is flexible, and you can write all kinds of exploratory queries. For instance, the following query finds topics that share a large number of predicate-object pairs with dbr:Diamond_Dogs , a David Bowie album:

select ?s (COUNT(*) as ?cnt) { dbr:Diamond_Dogs ?p ?o . ?s ?p ?o . } GROUP BY ?s ORDER BY DESC(?cnt) LIMIT 10

and if you run this against the DBpedia Public SPARQL endpoint you get a very nice list of similar topics.

s cnt http://dbpedia.org/resource/Diamond_Dogs 158 http://dbpedia.org/resource/Aladdin_Sane 50 http://dbpedia.org/resource/Station_to_Station 47 http://dbpedia.org/resource/Young_Americans_(album) 44 http://dbpedia.org/resource/Low_(David_Bowie_album) 41 http://dbpedia.org/resource/Lodger_(album) 41 http://dbpedia.org/resource/Never_Let_Me_Down 39 http://dbpedia.org/resource/Let's_Dance_(David_Bowie_album) 38 http://dbpedia.org/resource/Hunky_Dory 36 http://dbpedia.org/resource/Tonight_(David_Bowie_album) 36

That particular query takes a few seconds to run, but it's easy to write a similar query that consumes more resources such as

select ?s (COUNT(*) as ?cnt) { dbr:David_Bowie ?p ?o . ?s ?p ?o . } GROUP BY ?s ORDER BY DESC(?cnt)

if you run that query on the public SPARQL endpoint (please don't), you'll get a much less nice result:

Virtuoso S1T00 Error SR171: Transaction timed out SPARQL query: select ?s (COUNT(*) as ?cnt) { dbr:David_Bowie ?p ?o . ?s ?p ?o . } GROUP BY ?s ORDER BY DESC(?cnt)

This is not just a problem with SPARQL, it's a problem that affects any API. If an API is simple and only allows you to do a limited number of things, the cost of running that API is predictable, so it can be offered for free or for sale at a specific price per API call. If an API lets you do arbitrarily complex queries, however, the cost of a query can vary by factors of a million or more, so resource limits must be applied.

An alternative to the public SPARQL endpoint is the private SPARQL endpoint. Here you install a triple store on your own computer, load data, and then run your own queries. People who follow this route run into two problems:

it takes a lot of hardware. You need 16 to 32GB of memory to comfortably work with DBpedia in a triple store. Memory upgrades aren't that expensive today, but most laptop computers have a limited number of memory slots. Other people don't want to tie up their computer for hours with a task that can slow it down, repeat or it it takes a lot of time and technical skill; for one thing, many triple stores lack an effective bulk loader. Openlink Virtuoso has a good bulk loader, but it takes effort to configure it for great performance and reliability. It can take several hours to load a large data set, and if mistakes mean you need to repeat the load several times this can be a cumbersome and frustrating process. (Without automation, you might be tempted to use a data set that than perfect to avoid the process of doing a reload to get it right)

The AWS Marketplace lets us team up with Amazon Web Services to sell you a package of matching hardware, software and data. (See our product, the Ontology2 edition of Dbpedia 2015-10) It is much easier to automate the build process in the cloud, because we always start with an identical cloud server which has a fast connection to the net, as compared to an installer which would need to adapt to whatever the state of your desktop or server is.