To solve the lack of RDF import in SMW, the RDFIO suite was developed, including the RDFIO SMW extension and the standalone rdf2smw tool. The SMW extension consists of a set of functional modules, each consisting of a MediaWiki Special page with a web form, or a commandline script. A description of the features and intended use of each of these parts follows. See also Fig. 1 for a graphical overview of how the different parts fit together.

RDF import web form

The RDF import web form allows the user to import RDF data in Turtle format either from a publicly accessible URL on the internet, by manually entering or copy-and-pasting the data into a web form. This allows users to import small to moderate amounts of RDF data without the need for command-line access to the computer where the wiki is stored, as is often required for batch import operations. The drawback of this method is that since the import operation is run as part of the web server process, it is not suited for large amounts of data. This is because it would then risk using up too much computational resources from the web server and making the website unresponsive for other users for a single-server setting, which is often used in the biomedical domain.

SPARQL import web form

The SPARQL import web form allows importing all data from an external triple store exposed by a publicly accessible SPARQL endpoint. Based on an URL pointing to an endpoint it will in principle create a mirror of it, since the data imported into the wiki will in turn be exposed as a SPARQL endpoint (see the corresponding section below). The import is done with a query that matches all triples in the external triple store (In technical terms, a SPARQL clause of the form: WHERE { ?s ?p ?o }). In order not to put too much load on the web server, the number of triples imported per execution is by default limited by a pre-configured limit. This enables performing the import in multiple batches. The user can manually control the limit and offset values, but the offset value will also be automatically increased after each import, so that the user can simply click the import button multiple times, to import a number of batches with the selected limit of triples per batch.

SPARQL endpoint

The SPARQL endpoint (see Fig. 2) exposes all the semantic data in the wiki as a web form where the data can be queried using the SPARQL query language. The endpoint also allows external services to query it via the GET or POST protocols. It can output either a formatted HTML table for quick previews and debugging of queries, a machine-readable XML result set, or full RDF triples in RDF/XML format. The RDF/XML format requires the use of the CONSTRUCT keyword in the SPARQL query to define the RDF structure to use for the output. Using CONSTRUCT to output RDF/XML basically amounts to a web based RDF export feature, which is why a separate RDF export web form was not deemed necessary.

Fig. 2 A screenshot of the SPARQL endpoint web form in RDFIO. A key feature of the SPARQL endpoint is the ability to output the original RDF resource URIs of wiki pages, that were used in the original data imported. This can be seen by the checkbox option named “Query by Equivalent URIs” and “Output Equivalent URIs”, named so because the original URIs are stored using the “Equivalent URI” special property, on each page created in the import Full size image

The SPARQL endpoint also allows adding new data to the wiki using the INSERT INTO statement available in the SPARQL+ extension supported by ARC2.

RDF import batch script

The batch RDF import batch script (importRdf.php) is executed on the command-line, and allows robust import of large data sets. By being executed using the standalone PHP or HHVM (PHP virtual machine) [36, 37] executable and not the web server process, it will not interfere with the web server process as much as the web form based import. It will also not run into the various execution time limits that are configured for the PHP process or the web server. While a batch-import could also be implemented using the web form by using a page reload feature, or an AJAX-based JavaScript solution, this is a more complex solution that has not yet been addressed due to time constraints. Executing the batch RDF import script in the terminal can look like in Fig. 3.

Fig. 3 Usage of the command-line import tool in RDFIO. The figure shows examples of shell commands to use to import an RDF dataset, in this case in N-triples format, saved in a file named dataset.nt. The steps are: i) Change directory into the RDFIO/maintenance folder, and then ii) execute the importRdf.php script. One can set the variables --chunksize to determine how many triples will be imported at a time, and --offset to determine how many triples to skip in the beginning of the file, which can be useful if restarting an interrupted import session. The $WIKIDIR variable represents the MediaWiki base folder Full size image

Stand-alone RDF-to-MediaWiki-XML conversion tool (rdf2smw)

The rdf2smw tool uses the same strategy for conversion from RDF data to a wiki page structure as the RDFIO extension but differs in the following way: Whereas the RDFIO extension converts RDF to wiki pages and writes these pages to the wiki database in one go, the standalone tool first converts the full RDF dataset to a wiki page structure and writes it to an XML file in MediaWiki’s XML import format, as illustrated in Fig. 1. This format is very straightforward, storing the wiki page data as plain text, which allows to manually inspect the file before importing it.

Programs written in Go are generally orders of magnitude faster than similar programs written in PHP. This performance difference together with the fact that the execution of the standalone rdf2smw tool is separate from the web server running the wiki is crucial when importing large data sets (consisting of more than a few hundred triples) since the import requires demanding data operations in memory such as sorting and aggregation of triples per subjects. This is the main reason why this external tool was developed.

The usage of the tool together with MediaWiki’s built-in XML import script is illustrated in Fig. 4.

Fig. 4 Command-line usage of the rdf2smw tool. The figure shows the intended usage of the rdf2smw command line tool. The steps are, one per line in the code example: i) Execute the rdf2smw tool to convert the RDF data into a MediaWiki XML file. ii) Change directory into the MediaWiki maintenance folder. iii) Execute the importDump.php script, with the newly created MediaWiki XML file as first argument. The $WIKIDIR variable represents the MediaWiki base folder Full size image

RDF export batch script

The RDF export batch script (exportRdf.php) is a complement to the RDF export functionality available in the SPARQL endpoint, which analogously to the import batch script allows robust export of large data sets without the risk for time-outs and other interruptions that might happen to the web server process or the user’s web browser.

Executing the batch RDF export script in the terminal can look like in Fig. 5.

Fig. 5 Usage of the command-line export tool in RDFIO. The figure shows examples of shell commands to use to export an RDF dataset, in this case in N-triples format, into a file named dataset.nt. The steps are: i) Change directory into the RDFIO/maintenance folder, and then ii) execute the exportRdf.php script, selecting the export format using the --format parameter. The --origuris flag tells RDFIO to convert SMW’s internal URI format back to the URIs used when originally importing the data, using the linking information added via SMW’s “Equivalent URI” property Full size image

An overview of the RDF import process

As can be seen in Fig. 1, all of the import functions run through the same RDF-to-wiki conversion code except for the rdf2smw tool which has a separate implementation of roughly the same logic in the Go programming language.

The process is illustrated in some detail in Fig. 6 and can be briefly be described with the following processing steps:

Fig. 6 A simplified overview of the RDF to wiki page conversion process. The figure shows in a somewhat simplified manner, the process used to convert from RDF data to a wiki page structure. Code components are drawn as grey boxes with cog wheels in the right top corner, while data are drawn as icons without a surrounding box. From top to bottom, the figure shows how RDF triples are first aggregated per subject, then converted into one wiki page per subject, while converting all URIs to wiki titles, for new pages and links to pages, where-after the pages are either written directly to the wiki database (the RDFIO SMW extension), or converted to XML and written to files (the standalone rdf2smw tool) Full size image

All triples in the imported chunk (number of triples per chunk can be configured for the commandline import script while the web form imports a single chunk) are aggregated per subject resource. This is done since each subject resource will be turned into a wiki page where predicate-object pairs will be added as SMW fact statements consisting of a corresponding property-value pair.

WikiPage objects are created for each subject resource. The title for this page is determined from the Uniform Resource Identifier (URI) of the subject, or from some of the predicates linked to this subject, according to a scheme described in more detail below.

All triples with the same subject, which have now been aggregated together, are turned into SMW facts (property-value pairs), to be added to the wiki page. Predicate and object URIs are converted into wiki page titles in the process, so that the corresponding property and value will be pointing to valid wiki page names. Naturally, if the object is a literal rather than an URI, no transformation will be done to it. During this process the pages corresponding to the created property titles are also annotated with SMW data type information, based on XML Schema type information in the RDF source data.

Optionally, the facts can be converted into a MediaWiki template call, if there is a template available that will write the corresponding fact, by the use of its parameter values.

In the rdf2smw tool only, the wiki page content is then wrapped in MediaWiki XML containing meta data about the page, such as title and creation date.

In the RDFIO SMW extension only, the wiki page objects are now written to the MediaWiki database.

Converting URIs to user friendly wiki page titles

The primary challenge in the described process is to figure out user friendly wiki titles for the resources represented by URIs in the RDF data. This is done by trying out a defined set of strategies, stopping as soon as a title could be determined. The strategies start with checking if there is already a page available connected to the URI via an Equivalent URI fact in the wiki text. If this is the case, this existing title (and page) will be used for this triple. If that is not the case, the following strategies are tried in the stated order: 1) If there are any properties commonly used to provide a title or label for a resource, such as dc:title from the Dublin Core ontology [38], the value of that property is used. 2) If a title is still not found, the base part, or “namespace” of the URI is shortened according to an abbreviation scheme provided in the RDF dataset in the form of namespace abbreviations. 3) Finally, if none of the above strategies could provide an accepted title, the “local part” of the URI (The part after the last / or # character in the URL) is used.

Performance

Table 1 provides information about the time needed to import a given number of triples (100, 1000, 10000 or 100000) drawn as subsets from a test dataset (the Comparative Toxicogenomics Database [39], converted to RDF by the Bio2RDF project), using the RDF SMW extension directly via the importRdf.php commandline script, as well as by alternatively converting the data to MediaWiki XML files with the rdf2smw tool and then importing them using MediaWiki’s importDump.php script. Note that when importing using the rdf2smw tool the import is thus performed in two phases.

Table 1 Execution times for importing RDF data into SMW using the importRdf.php script in the RDFIO extension (column 2) and converting to MediaWiki XML files using the rdf2smw tool and then importing the generated XML files with MediaWiki’s built-in XML import tool respectively (column 3 and 4), for a few different dataset sizes (column 1) Full size table

The tests were performed in a VirtualBox virtual machine running Ubuntu 15.10 64bit, on a laptop running Ubuntu 16.04 64bit. The laptop used was a 2013 Lenovo Thinkpad Yoga 12 with a 2-core Intel i5-4210U CPU, with base and max clock frequencies of 1.7 GHz and 2.7 GHz respectively, and with 8 GB of RAM. The PHP version used was PHP 5.6.11. Time is given in seconds and where applicable also in minutes and seconds, or hours, minutes and seconds.

Manual testing by the authors show that the performance of an SMW wiki is not noticeably affected by multiple users reading or browsing the wiki. An import process of many triples can temporarily slow down the browsing performance for other users because of table locking in the database, though. This is a characteristic common to MediaWiki wikis, when a large import operation is in progress, or if multiple article updates are done at the same time, unless special measures are taken, such as having separate, replicated, database instances for reading, to alleviate the load on the primary database instance.

Continuous integration and testing

The fact that RDFIO is an extension to a larger software (SMW), which itself is an extension of MediaWiki and that much of their functionality depends on state in a relational database, has added complexity to the testing process. Recently though, continuous integration systems as well as improved test tooling for MediaWiki and SMW has enabled better automated testing also for RDFIO. We use CircleCI as continuous integration system and results from this and other services are added as indicator buttons on the README file on the respective GitHub repositories.

As part of the build process, system tests are run for the RDF import function and for the RDF export function, verifying that the exported content matches the data that was imported. In addition, work has been started to add unit tests. User experience testing has been carried out in real-world projects mentioned in the introduction, where some of the authors were involved [16, 17].

Round-tripping

As mentioned above, a system test for the round-tripping of data via the RDF and import and export functions is run, to ensure that no data is corrupted in the process. It is worth noting though that the RDF export will generally output more information than what is imported. This is because SMW does store certain meta data about all pages created, such as modification date etc. In the system test, these data are filtered out so that the test checks only consistency of the triples that were imported using RDFIO. An example of the difference between the imported and exported data can be seen in Fig. 7.

Fig. 7 A comparison between data before and after an import/export round-trip. This figure shows to the left a dataset containing one single triple in turtle format. To the right is shown the data resulting from performing an import/export round-trip – that is, importing the initial data into a virtually blank wiki (The wiki front page “Main Page” being the only page in the wiki) and then running an export again. It can be seen in the exported data how i) The “Main Page” adds a certain amount of extra data, and ii) how there is a substantial amount of extra metadata about each resource added by SMW. The subject, predicate and value of the initial triple is color-coded with the same colours in both code examples (both before and after) to make it easier to find Full size image

Known limitations

At the time of writing this, we are aware of the following limitations in the RDFIO suite of tools:

The rdf2smw tool supports only N-Triples format as input.

There is currently no support for importing triples into separate named graphs, such that e.g. imported and manually added facts could be separated and exported separately.

There is no functionality to detect triples for removal, if updating the wiki with a new version of a previously imported dataset, containing deprecated or having some triples simply removed.

Cases with thousands of triples for a single subject leading to thousands of fact statements on a single wiki page – while technically possible – could lead to cumbersome manual editing.

These limitations are planned to be addressed in future versions of the tool suite.

Demonstrators

Demonstrator I: Orphanet - rare diseases linked to genes

An important usage scenario for RDFIO is to visualise and enable easy navigation of RDF data by bootstrapping an SMW instance from an existing data source. To demonstrate this, the open part of the Orphanet dataset [40] was imported into SMW. Orphanet consists of data on rare disorders, including associated genes. The dataset was already available in RDF format through the Bio2RDF project [12], from where the dataset was accessed and imported into SMW. This dataset consisted of 29059 triples and was first converted to MediaWiki XML using the standalone rdf2smw tool, which was then imported using MediaWiki’s built-in XML import script. This presented an easy to use platform for navigating the Orphanet data, including creating listings of genes and disorders. Some of these listings are created automatically by SMW but additional listings can also be created on any page in the wiki, including on the wiki pages representing RDF resources, by using the template feature in MediaWiki in combination with the inline query language in SMW [41].

An example of a useful user-created listing on an RDF node, was to create a listing of all the disorder-gene associations linking to a particular gene and the corresponding disorder, on the templates for the corresponding gene pages (For an example, see Fig. 8). In the same way, a listing of the disorder-gene association linking to particular disorders and the corresponding genes, was created on the templates for the corresponding disorder pages.

Fig. 8 Screenshot of a wiki page for a gene in the Orphanet dataset. In the middle of the page, the listing of gene disorder associations and the corresponding disorders is shown. Note that these details are not entered on this page itself, but are queried using SMW’s inline query language and dynamically displayed. To the right are details entered directly on the page Full size image

This example shows how it is possible, on a wiki page representing an RDF resource, to list not only information directly linked to this particular resource, but also information connected via intermediate linking nodes. Concretely, in the example shown in Fig. 8 we list a resource type (diseases) on a page representing a gene even though in the RDF data diseases are not directly linked to genes. Instead they are linked via an intermediate “gene-disorder association” node.

Demonstrator II: DrugMet - cheminformatics/metabolomics

The DrugMet dataset is an effort at collecting experimental pK a values extracted from the literature, linked to the publication from which it was extracted, and to the chemical compounds for which it was measured. The DrugMet dataset was initially created by manually adding the details in a self-hosted Semantic MediaWiki. The data was later transferred to the Wikidata platform [21] for future-proofing and enabling access to the data for the wider community.

This demonstrator highlights how this data could be further curated by extracting the data again from Wikidata into a locally hosted SMW for further local curation.

The data was exported from Wikidata using its publicly available SPARQL REST interface [42]. The extraction was done using a CONSTRUCT query in SPARQL allowing to create a custom RDF format specifically designed for the demonstrator. For example, in addition to the publication and compound data, the query was modified to include rdf:type information for all the compounds, which is used by the RDFIO command line tool to generate a MediaWiki template call and corresponding template, for all items of this type.

After the data was imported into a local SMW wiki, it allowed to create a page with an SMW inline query displaying a dynamically sorted list of all the compounds, their respective pK a values, and links to the publications from where the pK a values were originally extracted. The query for this extraction is shown in Fig. 9, and the list is shown in Fig. 10.

Fig. 9 The SPARQL query for extracting DrugMet data. This screenshot shows the SPARQL query for extracting DrugMet data in Wikidata’s SPARQL endpoint web form. This query can be accessed in the Wikidata SPARQL endpoint via the URL: goo.gl/C4k4gx Full size image

Fig. 10 A dynamic listing of DrugMet data. The listing shows a locally hosted SMW wiki with a list of compounds and related information. The list is a custom, dynamically generated listing of Compound name, pK a value and a link to the publication from which each pK a value was extracted, created using SMW’s inline query language Full size image

Implications of the developed functionality

The demonstrators above show that the RDFIO suite of tools is successfully bridging the worlds of the easy-to-use wiki systems and the somewhat more technically demanding wider Semantic Web. This bridging has opened up a number of useful scenarios for working with semantic data in a flexible way, where existing data in semantic formats can easily and flexibly be combined by using the templating and querying features in SMW. This leads to a powerful experimentation platform for exploring and summarising biomedical data, which earlier was not readily accessible.

Availability

Complete information about the RDFIO project can be found at pharmb.io/project/rdfio

A canonical location for information about the RDFIO SMW extension is available at MediaWiki.org at www.mediawiki.org/wiki/Extension:RDFIO

All the software in the RDFIO suite is available for download on GitHub, under the RDFIO GitHub organisation, at github.com/rdfiowhere the RDFIO SMW extension is available at github.com/rdfio/rdfio, the rdf2smw tool at github.com/rdfio/rdf2smw and an automated setup of a virtual machine with a fully configured SMW wiki with RDFIO installed is available at github.com/rdfio/rdfio-vagrantbox.

Outlook

Planned future developments include enhancing the rdf2smw tool with support for more RDF formats as input.

Further envisioned development areas are:

iv) Separating the ARC2 data store and SPARQL endpoint into a separate extension, so that the core RDFIO SMW extension does not depend on it. This could potentially improve performance of data import and querying, as well as make the core RDFIO extension easier to integrate with external triple stores via SMW’s triple store connector. v) Exposing the RDF import functionality as a module via MediaWiki’s action API [43]. This would allow external tools to talk to SMW via an established web interface. vi) Allowing to store domain specific queries tied to certain properties that can, on demand, pull in related data for entities of a certain ontology such as gene info from Wikidata, for genes.