This post contains the first set of results from my SPARQL extension survey. I’ve completed an initial survey of a number of different SPARQL processors to itemise the extension functions that each of them have implemented. This will be an ongoing activity as implementations evolve continually, but I thought it would be useful to summarise my findings so far.

If you want to look at the results for yourselves, then I’ve created a publically accessible Google Spreadsheet that lists all of the results. The first tab of the spreadsheet includes the list SPARQL endpoints/processors that I’ve surveyed.

I completed the initial round of the survey a few weeks ago, so any updates since then won’t have been included.

List of Implementations

The full list of surveyed processors/endpoints consists of:

Allegrograph

ARQ

Corese

Geospatialweb project

Mulgara

OpenAnzo

Openlink Virtuoso

Sesame

TopBraid product suite

XMLArmyKnife.com

If I’ve missed any other implementations that support extension functions then please let me know. I’m aware that other engines also support property functions, but I’ve not included this type of extension in my first survey round. I’ll be exploring that area in the new year.

I want to thank the implementers of a number of these systems for providing me with additional information, feedback and support as I’ve compiled the results. If anything has been misrepresented or simply missed, then you have my apologies and I will endeavour to fix any reported problems ASAP. The goal is to perform a fair, objective survey of the current situation: I’m not pushing any agenda here, other than a desire for convergence and continual improvement.

Breakdown of Results

The currently implemented extension functions can be organised into the following categories:

String

Date/Time

Math/Logic

RDF/Graph Manipulation

Geospatial

Network

The first three categories, covering string, date, and mathematical manipulations have the largest number of functions. This is as expected as these areas are the most useful for any programming or query language. Given that extension functions are restricted to value testing in SPARQL 1.0, then you would also assume that they would be most commonly used to provide additional flexibility when comparing strings, manipulating and comparing dates, and performing simple mathematical functions.

Very few implementations offer any functions in the remaining categories. I had originally expected to find more functions in the Geospatial category but I think that the majority of exploration in that area has focused on using property functions instead.

I would expect to see the number of distinct functions in each area to grow with the delivery of SPARQL 1.1, if it becomes possible to use them as part of a SELECT expression, e.g. to create new values/bindings, as well as just in FILTER tests. Those implementations that already offer a wide range of additional functions, such as Virtuoso, already have additional SPARQL language extensions that allow functions to be used in this way.

Currently however the numbers are inflated due to repeated implementation of the same function in different engines. For example ARQ, Virtuoso and Corese all have their own variant of a “ contains ” function.

Portability

This brings me to the topic of query portability. A SPARQL query is portable if it can run unchanged on any SPARQL processor. A query is not portable if it uses proprietary extensions that are not supported on other processors. Implementers can increase portability by supporting each others extensions or by converging on a common set of functions. As a standard develops, you’d expect to see some replication of functions across engines before pressures from users, and a better understanding of the utility of various extensions, encourages convergence.

It’s encouraging to see that some replication of functions is happening across SPARQL engines. For example both Mulgara and TopQuadrant support a basic set of string functions that were originally provided by the ARQ engine. These functions are part of the XPath Functions and Operators library which acts as a handy “off-the-shelf” set of function definitions for SPARQL implementors to converge around. Mulgara also now supports a number of the EXSLT functions which can act as another reference point for useful function definitions.

Looking at the list of extensions, its easy to see that more convergence could take place as there are plenty of other extension functions that have been independently implemented. Expanding the set of commonly used functions in SPARQL is currently a time-permitting feature for SPARQL 1.1.

Replication of functions across implementations is partially hampered because of a couple of non-standard ways that extension functions have been implemented. For example both Corese and Virtuoso implement their extension functions as language extensions, i.e. they don’t quite conform to the SPARQL 1.0 recommendation. Corese doesn’t associate its functions with a URI, i.e. they are just functions that are exposed in the basic language. The Virtuoso “bif” (built-in) functions are used with a prefix (e.g. bif:contains ) but this prefix is not (and cannot) be associated with a URI. In both cases this means that implementations cannot replicate the functions using existing extension points: they’d have to be implemented with similar language extensions, or query rewriting.

Conclusions and Recommendations

I’m encouraged to see the wide range of experimentation that has been taking place around SPARQL extensions as it illustrates that developers are exploring how to use the language in a variety of ways. Extensions also indicate areas where the query language could be extended to encourage interoperability and address common issues faced by developers.

There are clearly a common set of functions around strings, dates and mathematical operators that ought to be available as a core part of the language. If the SPARQL 1.1 specification doesn’t end up defining this then I’d like to encourage the implementer community to do further work to explore replicating useful extensions or converging on a common set outside of the Working Group.

To help this process along it would be useful for developers to provide more feedback on the functions they provide useful, and for some statistics to be gathered around which functions are being commonly used in practice.

Right now there are a common set of functions available from the ARQ engine that are implemented in at least two other SPARQL processors. The same functions can be ported to other engines with a minimum of query rewriting, often with little more than changes to query prefixes.

My other recommendation at this stage is that implementers need to work harder on documenting the extensions they provide. Some engines have pretty good documentation, but for others the documentation is either hard to find or clearly lagging behind the latest code base. Publishing documentation about extensions, ideally with examples, really does help developers get started much quicker.