XLWrap – Spreadsheet-to-RDF Wrapper

News

Sources available on github: https://github.com/theandyl/xlwrap

Some new features have been discussed at the Linked Data Camp 2009 in Vienna: http://linkeddatacamp.org/wiki/XLWrap_state_and_feature_requests

Please subscribe to the XLWrap Users mailing list!

Contents

XLWrap is a spreadsheet-to-RDF wrapper which is capable of transforming spreadsheets to arbitrary RDF graphs based on a mapping specification. It supports Microsoft Excel and OpenDocument spreadsheets such as comma- (and tab-) separated value (CSV) files and it can load local files or download remote files via HTTP.

Application Areas:

Integration of spreadsheets from distributed locations based on ontologies (together with SemWIQ, it is possible to integrate spreadsheets with other data sources such as relational databases)

based on ontologies (together with SemWIQ, it is possible to integrate spreadsheets with other data sources such as relational databases) Publication of data from spreadsheets on the Web of Data as RDF (and via SPARQL)

on the Web of Data as RDF (and via SPARQL) Quick setup of an editable data source for the rapid prototyping of Semantic Web and Linked Data applications

Consider, we would like to make data from this spreadsheet available via a SPARQL endpoint. The spreadsheet contains revenues from a company represented in a cross table by country, year, and products. We want to provide the generated RDF data according to the SCOVO vocabulary.

These are the only three steps we need to do:

Download and extract XLWrap (includes XLWrap-Server)

Create a mapping (explained below) with your favorite editor and place it into the folder

[xlwrap-folder]/mappings

Start XLWrap-Server:

$ bin/server (with -p for a port other than :8900)

That's it!

You can use Snorql (known from D2R-Server) to browse the wrapped dataset or use the SPARQL endpoint at http://localhost:8900/sparql.

Here is the complete target graph for the given example (alternatively in RDF/XML syntax).

Additional examples are provided as part of the Mapping Design Patterns collection.

A mapping is provided as part of an RDF file in TriG syntax. TriG is very similar to N3 and allows the representation of multiple (named) RDF graphs in a single file. An XLWrap mapping file must contain exactly one graph that contains exactly one RDF instance of xl:Mapping (typically as part of the default graph). Other graphs in the same file will be used to denote template graphs.

Example:

{ # default graph in TriG file [] a xl:Mapping ; xl:template [ xl:fileName "file:example.xls" ; xl:sheetName "foo" ; xl:templateGraph :Foo ; xl:transform [ ... ] ] . } :Foo { rdfs:label "A2 & ' ' & B2"^^xl:Expr ; ... }

XLWrap mappings are based on template graphs, which may contain XLWrap expressions including cell references similar to a typical spreadsheet application such as Microsoft Excel or OpenOffice Calc. Template graphs are repeatedly applied on a work sheet (or other work sheets) in order to produce the target graph. Depending on the representation of the information stored in the spreadsheet, which may be flat tables or cross tables over multiple sheets and files, each template graph is moved across sheets and subsequently applied for different combinations of cells. How template graphs are moved is specified by transform operations.

The XLWrap mapping vocabulary can be found here: http://purl.org/NET/xlwrap#.

An xl:Mapping consists of one or more map templates, which are denoted by the property xl:template. Example:

{ # default graph (mapping specification) [] a xl:Mapping ; xl:template [ ... ] ; xl:template [ ... ] ; ... }

Each map template (value of xl:template ) consists of the following components:

base workbook (required, one): xl:fileName, a string – a local or remote spreadsheet file

File names are interpreted as URIs. HTTP URIs are recognized and local file names are automatically prefixed with file:

Examples:

xl:fileName "http://example.org/path/foo.xls" xl:fileName "file:foo.ods" xl:fileName "foo.ods" xl:fileName "../path/file.csv"

base worksheet (optional, one): xl:sheetName, a string or xl:sheetNumber, an integer – default is the first sheet of the base workbook

In case of CSV files this property should be omitted or the sheet #1 must be specified. Examples:

xl:sheetName "Sheet 1" xl:sheetNumber "4"^^xsd:int xl:sheetNumber "4" # also allowed as plain literal

constant graph (optional, multiple allowed): xl:constantGraph, a graph name – the referred graph must exist in the TriG file

The constant graph is optional and processed only once. It may contain XLWrap Expressions.

Example:

{ [] a xl:Mapping ; ... xl:constantGraph :ANamedGraph ... } :ANamedGraph { ex:dataset dc:creator "Michael Jackson" ; dc:date "NOW()"^^xl:Expr ; rdfs:seeAlso "FILENAME(A1)"^^xl:Expr ; }

template graph (required, one): xl:templateGraph, a graph name – the referred graph must exist in the TriG file

The template graph is evaluated according to the transform operations. It may contain XLWrap Expressions.

Example:

{ [] a xl:Mapping ; ... xl:templateGraph :ANamedGraph ... } :ANamedGraph { [ xl:uri "'http://example.org/' & URLENCODE(A2 & B2)"^^xl:Expr ] a foaf:Person ; foaf:name "A2 & ' ' & B2"^^xl:Expr ; foaf:mbox_sha1sum "SHA(C2)"^^xl:Expr ; }

sequence of transform operations (optional, one sequence): the RDF sequence is specified with the xl:transform property, each element must be an instance of the (abstract) RDF class xl:Transformation. These are the instructions for shifting and repeating the template graph multiple times for different spreadsheet cells. If omitted, the original template graph is applied just once.

Example:

xl:transform [ a rdf:Seq ; rdf:_1 [ a xl:RowShift ; xl:breakCondition "ALLEMPTY(*.*)" ; ] ; rdf:_2 [ a xl:SheetShift ; xl:repeat "3" ; ]

The following transform operations (sub-classes of xl:Transformation) are currently available:

row shift (xl:RowShift)

Example:

[ a xl:RowShift ; xl:steps "2" ; xl:repeat "2000" ; xl:restriction "Sheet1.*" ; xl:breakCondition "ALLEMPTY(*.*)" ; ]

column shift (xl:ColShift)

Example:

[ a xl:ColShift ; xl:steps "5" ; xl:repeat "1000" ; xl:restriction "A3:N20;A1;'Sheet 2'.*" ; xl:breakCondition "A1 == 'foo'" ; ]

sheet shift (xl:SheetShift)

Example:

[ a xl:SheetShift ; xl:steps "1" ; xl:repeat "10" ; xl:restriction "A1;A4:A6;B9" ; xl:breakCondition "INT(LEFT(SHEETNAME(A1), 4)) <= 2007" ; ]

sheet repeat (xl:SheetRepeat)

Example:

[ a xl:SheetRepeat ; xl:restriction "*.*" ; xl:breakCondition "A1 == 'foo'" ; xl:sheetNames "Sheet 1, Sheet 2, 'Sheet,, 3', Sheet 4" ; ]

file repeat (xl:FileRepeat)

Example:

[ a xl:FileRepeat ; xl:restriction "*.*" ; xl:breakCondition "A1 <= 10" ; xl:fileNames "http://example.com/foo.xls, file:localfile.csv" ; ]

For each of these operations, an optional range restriction (property xl:restriction) can be specified, which restricts the operation to a multi-range, i.e. only range references within the restriction are transformed (default is any range: *.* ).

Furthermore, an optional break condition can be specified as an XLWrap Expression with the xl:breakCondition property (default condition expression: false, i.e. "never break until end of file or specified times repeated"). The condition is evaluated after a template graph is transformed and if it evaluates to true, the transform operation is skipped and XLWrap continues with the next stage of the following transform operation.

For shift operations it is possible to specify the steps (in terms of columns/rows/sheets to shift) with the property xl:steps (default is 1 ) and the number of times the operation should be repeated with the property xl:repeat (default is the maximum integer value of the runtime system).

In case of the repeat operations it is required to specify a list of sheet names with the property xl:sheetNames or file names, respectively, with xl:fileNames. They are specified as a comma-separated list in a plain literal.

XLWrap expressions are used:

within template graphs

within constant graphs

for conditions of transform operations (in this case the return value must be boolean, e.g. A4 == 'foo' )

The basic elements of XLWrap expressions are:

numbers

strings (unquoted if they do not contain special characters, single or double quoted otherwise)

(unquoted if they do not contain special characters, single or double quoted otherwise) the boolean constants true and false

constants true and false range references (e.g. A3 , Sheet1.A3:Sheet3.B9 , etc.)

, , etc.) the following unary operators :

: + and - for negative numbers: e.g. -3 or -(3+9)

and for numbers: e.g. or

% to denote percentage of numbers: e.g. 40-20% or SUM(A3:A9)%

to denote of numbers: e.g. or

! ( logical not ) for boolean values or sub expressions: e.g. !(A3 == 'foo')

( ) for boolean values or sub expressions: e.g. the following binary operators:

string concatenation : & (e.g. 'foo' & A4 )

: (e.g. )

arithmetic operators : + , - , * , / , ^ (e.g. 5*4 + 2*(9+3)

: , , , , (e.g.

comparison operators : < , <= , == or = , != or <> , >= , > (e.g. A3 > 5*4 )

: , , or , or , , (e.g. )

logical operators : && , ||

: , function calls (available functions)

(available functions) brackets for grouping of sub-expressions (e.g. (A1 + 4) * (A3 + 10)

The complete grammar for XLWrap expressions is available as part of the distribution.

Expressions may contain range references in order to obtain values from spreadsheets. The common syntax for range references is:

( ( [filename] "#$" )? [sheet1] "." )? [column1] [row1] ( ":" ( [sheet2] "." )? [column2] [row2] )?

File and sheet names have to be quoted if they contain spaces or special characters. Please note the optional parts inside of the brackets. There are actually different kinds of ranges references:

cell range: ( ( [filename] "#$" )? [sheet1] "." )? [column1] [row1]

Examples:

A4 file:foo.xls#$Sheet1.A4 'Sheet 1'.Z9

box range with obligatory second part: ":" ( [sheet2] "." )? [column2] [row2]

Examples:

A4:C9 file:foo.xls#$Sheet1.A4:Sheet3.B6

full sheet range: ( [filename] "#$" )? [sheet] ".*"

Examples:

Sheet1.* http://example.org/file.xls#$'sheet x'.*

the special any range construct: *.*

and multi ranges: multiple ranges separated by semicolons

Example:

A3; Sheet1.A3:B20; file:foo.xls#$Sheet1.*

Furthermore, it is possible to specify sheets by numbers instead of names with a # -prefix: e.g. #1.A3 refers to A3 on the first sheet of the current workbook (the current workbook is usually the base workbook specified for the map template, however, as a consequence of transform operations this can also be another workbook file).

Depending on the situation, specific kinds of ranges may be valid or not. For instance, specifying a box range as a single expression is not valid: A3:A9 is invalid. However, it is possible to specify a box range as an argument of the function SUM() : SUM(A3:A9; A10; 34; 10) .

In order to construct URI nodes, a blank node is used with the special property xl:uri , specifying the URI of the node. For example:

[ xl:uri "'http://example.org/' & URLENCODE(A2 & B2)"^^xl:Expr ] a foaf:Person .

will create a statement like <http://example.org/Tim+Smith> a foaf:Person at runtime.

Similarly, it is possible to link anonymous resources (blank nodes):

[] a ex:Revenue ; scv:dimension [ xl:id "A4"^^xl:Expr ; a ex:Product ; rdfs:label "A4"^^xl:Expr ] .

The XLWrap processor will ensure, that all resources with equal xl:id values will get equal blank node identifiers in the target graph. This feature is very important for more sophisticated mappings. It is also used for the SCOVO example above.

The function library of XLWrap will be continuously extended. We will provide detailed documentation of functions in future. For the meanwhile, please have a look at the source code of the core functions.

It is also possible to add custom functions. Each function is implemented by a Java class. The name of the class must adhere to the following pattern: E_Func[NAME] where NAME is the name of the function as it is used as part of expressions (e.g. NOW() is implemented by the class E_FuncNOW ). The location on the classpath is not important, you can place it anywhere.

Implement E_FuncYOURFUNCTION by extending the abstract class at.jku.xlwrap.map.expr.func.XLExprFunction . Apart from defining a default constructor (without arguments), the only method you need to implement is this one from the interface at.jku.xlwrap.map.expr.XLExpr :

public XLExprValue<?> eval(ExecutionContext context) throws XLWrapException, XLWrapEOFException;

Function arguments are already managed by the abstract super class. To access them, use getArg(int i) .

Be sure to call .eval(context) for any argument you wish to process. For example eval() of E_FuncTRIM looks like this:

@Override public XLExprValue<?> eval(ExecutionContext context) throws XLWrapException, XLWrapEOFException { XLExprValue<?> v1 = getArg(0).eval(context); if (v1 == null) return null; else return new E_String(TypeCast.toString(v1).trim()); }

Also be aware of null values returned by evaluated arguments! Typically you will return also null or throw an exception. If cell references refer to empty cells, a null value will be produced and chained through the complete expression. If an expression returns null in the end, XLWrap will skip the correpsonding statement.

Currently you have to call FunctionRegistry.registerPackage(String pkg) somewhere in your code, where pkg is the package name where E_FuncYOURFUNCTION is located. In the next release it will be possible to dynamically load function libraries with the property xl:functionLib specified for the xl:Mapping instance.

Compile your code and place it onto the classpath.

XLWrap is now able to parse expressions including your custom function.

XLWrap can be used either in-process via the Jena API or as a SPARQL endpoint via the bundled XLWrap-Server.

Using XLWrap via the API

The relevant classes to look for are:

MapParser , which can be used to parse TriG files into instances of XLWrapMapping , which is the representation used internally by XLWrap.

, which can be used to parse TriG files into instances of , which is the representation used internally by XLWrap. XLWrapMaterializer , which can be used to dump spreadsheets to RDF models.

Example:

XLWrapMapping map = MappingParser.parse("mappings/iswc09-example-scovo.trig"); XLWrapMaterializer mat = new XLWrapMaterializer(); Model m = mat.generateModel(map); m.write(new FileOutputStream("docs/website/example/revenues-scovo.n3"), "N3");

The materializer returns an ordinary Jena model. If you want to provide your own Jena model (possibly an inference model), you can explicitly provide a target model:

Model m = mat.generateModel(map, targetModel);

The materializer instance can be reused multiple times for different processes. It will keep any workbooks in memory for faster processing of multiple mappings with common spreadsheet files (especially when they have been downloaded from the Web).

Using XLWrap-Server

Just start the server from the XLWrap directory: $ bin/server and place mapping files into the folder mappings . The command takes the following arguments:

-?,--help help -B,--dataset-base <uri-prefix> Dataset base URI prefix for Pubby (see http://www4.wiwiss.fu-berlin.de/pubby/) -c,--config <port> Joseki config file -D,--desc <text> Dataset description -d,--data <port> RDF data (.n3, .ttl, .rdf, .xml) -H,--homepage <url> Homepage -h,--host <hostname> hostname (if it cannot be correctly determined by the JVM) -p,--port <port> port (default: 8900) -T,--title <text> Dataset title

If no argument is specified, the shell script automatically adds "-c joseki-xlwrap.ttl" to the call. The options -c and -p are probably the most important ones. Some meta data displayed at the web frontend and used by Pubby can be specified also.

Now open your browser and go to http://localhost:8900 (please adjust the port if needed).

Useful Functions to provide Linked Data

XLWrap provides useful functions for mappings: HOSTNAME() and PORT() that can be used to obtain the current hostname and port the server is running on. Another important function is MAKEURI() , which can be used to create local URIs in order to make the resources accessible via the integrated Linked Data Browser Pubby. Call MAKEURI() with either one or two String arguments. In the first case, the argument is URL-encoded, in the second case, the first argument is appended as-is and the second String argument is URL-encoded.

Additionally, we will provide functions in future to help users creating external links to other Linked Data sources such as DBpedia. We will integrate several record linkage algorithms and geo-mapping functions to create links to DBpedia, geonames.org, etc.

Sources: https://github.com/theandyl/xlwrap

Downloads: http://sourceforge.net/projects/xlwrap/files/xlwrap/

Please have a look at the example and download XLWrap. For details on starting and using XLWrap-Server proceed with the usage section. In order to start creating your own mappings read the mappings section.

XLWrap has been developed because there previously was no spreadsheet-to-RDF wrapper that supported cross tables. The following related projects are all based on a row-by-row wrapping process and plain text CSV files only:

RDF123 (best until now, supports template graphs, still row-oriented, CSV plain text only)

Excel2RDF (no mappings, very basic, CSV plain text only)

ConvertToRdf (basic mapping, fixed rdf:type and 1:1 mapping from column headings to properties, CSV only)

TopBraid Composer (basic mapping, CSV and MS Excel)

Andreas Langegger, Wolfram Wöß (2009): XLWrap – Querying and Integrating Arbitrary Spreadsheets with SPARQL. In Proceedings of the 8th International Semantic Web Conference (ISWC2009), Washington D.C. LNCS 5823, Springer, 2009.

Andreas Langegger, Wolfram Wöß (2009): Querying and Semantically Integrating Spreadsheet Collections with XLWrap-Server – Use Cases and Mapping Design Patterns. Poster & Demo Session at the 8th International Semantic Web Conference (ISWC2009), Washington D.C., 2009.

Please use the XLWrap Users mailing list to get support.

If you need professional support in your project, please contact me directly.

Contact: aka AndyL

Many thanks to Richard Cyganiak for contributing bug fixes and very good ideas for improvements and new features.

XLWrap has been developed as another wrapper for the Semantic Web Integrator and Query Engine (SemWIQ).

Thanks to SourceForge.net for providing the infrastructure.