HXQ: A Compiler from XQuery to Haskell Download HXQ-0.19.0.tar.gz

Description

HXQ is a fast and space-efficient translator from XQuery (the standard query language for XML) to embedded Haskell code. The translation is based on Template Haskell. HXQ takes full advantage of Haskell's lazy evaluation to keep in memory only those parts of XML data needed at each point of evaluation, thus performing stream-based evaluation for forward queries (queries that do not contain backward steps). This results to an implementation that is as fast and space-efficient as any stream-based implementation based on SAX filters or finite state machines. Furthermore, the coding is far simpler and extensible since it is based on XML trees, rather than SAX events. Since HXQ uses lazy evaluation, you get the first results of non-blocking queries immediately, while the non-streaming XQuery processors must first parse the entire input file and construct the whole XML tree in memory before they produce any output.

Finally, HXQ can store XML documents in a relational database (currently MySQL or SQLite), by shredding XML into relational tuples, and by translating XQueries over the shredded documents into optimized SQL queries. The mapping to relational tables is based on the document's structural summary, which is derived from the document data rather than from a schema. It uses hybrid inlining to inline attributes and non-repeating elements into a single table, thus resulting to a compact relational schema. For each such mapping, HXQ synthesizes an XQuery that reconstructs the original XML document from the shredded data. This XQuery is fused with the user queries using partial evaluation techniques and parts of the resulting query are mapped to SQL queries using code folding rules so that all relevant predicates are promoted to SQL. This pushes most evaluation to the database query engine, thus resulting to a fast execution over large data sets.

Performance

HXQ shines best when used for data intensive applications. For example, the XQuery in tests/Test2.hs, which is against the DBLP XML document (420MB), runs in 36 seconds on my laptop PC and uses a maximum of 3.2MB of heap space (using the runtime options +RTS -H2m -M3.2m ). (All results are taken on an Intel Core 2 Duo 2.2GHz 2GB running ghc-6.8.3 on a 32-bit Linux 2.6.27 kernel.) To contrast this, Qexo, which compiles XQueries to Java bytecode, takes 1 minute 17 seconds and uses 1400MB of heap space for the same query, while XQilla, which is written in C++, takes 1 minute and 10 secs and uses 1150MB of heap space. For simple XPath queries, the fastest implementation I have ever tried is using SAX pipelines, which runs in 17secs and needs 3MB heap. Unfortunately, it is very hard to implement complex XQuery constructs using SAX, and one may end up simulating lazy evaluation using ad-hoc techniques.

For better performance in data intensive applications, one may use the database capabilities of HXQ. For example, when the DBLP file is shredded into a MySQL database and the appropriate index is created, the above query runs in 90 milliseconds.

HXQ uses the HXML parser for XML (developed by Joe English), which is included in the source. I have also tried hexpat, tagsoup, HXT, and HaXML Xtract, but they all have space leaks.

HXQ has two XML parsers: one that generates simple rose trees from XML documents, which can be processed by forward queries without space leaks, and another parser where each tree node has a reference to its parent. Some, but not all, backward axis steps (such as the parent axis /..) are removed from a query using optimization rules. If there are backward axis steps left in the query, then HXQ uses the latter parser, which may result to a performance penalty due to space leaks.

Installation Instructions (HXQ without Database Connectivity)

HXQ can be installed on most platforms but I have only tested it on Linux, Mac OS X, and Windows XP. The simplest installation is without database connectivity (ie, it can only process XQueries against XML text documents). If you want database connectivity (over MySQL or sqlite relational databases), look at the installation instructions for database connectivity.

First, you need to install the Glasgow Haskell Compiler, ghc. Optionally, if you want to modify the XQuery parser, you need to install the parser generator for Haskell, happy. If you are new to Haskell, please read How to install a Cabal package. The easiest way to install packages in Haskell is using cabal. On Linux, you can install Haskell and cabal using yum install ghc happy cabal-install . You must then update the list of known packages using cabal update .

The simplest way to install the HXQ library is by using the cabal command:

cabal install HXQ

(If you use the old base-3 ghc library, use the option -fbase3 in cabal). Then, to compile the xquery command line interpreter, you download xquery.hs and you do:

ghc --make xquery.hs -o xquery

tar xfz

runhaskell Setup.lhs configure --user runhaskell Setup.lhs build runhaskell Setup.lhs install

HXQ consists of the executable xquery , which is the XQuery command line interpreter, and the HXQ library.

To use the HXQ library in a Haskell program, simply import Text.XML.HXQ.XQuery .

Current Status

HXQ supports most essential XQuery features, although some system functions are missing (but are easy to add). Note that HXQ is a proof-of-concept (prototype) implementation; it's not fully compliant with the W3C specs. One may use HXQ as a basis for a fully compliant XQuery implementation (conforming to W3C test suits), but currently I do not have the time to do so. To see the list of supported system functions, run xquery -help . Here are some important differences between HXQ and the W3C specs:

Currently, all namespaces in HXQ XQueries must be defined using import schema or declare namespace . Although HXQ recognizes xmlns: attributes in XML files and XQuery constructions, these namespaces are not imported.

or . Although HXQ recognizes attributes in XML files and XQuery constructions, these namespaces are not imported. The XQuery semantics requires duplicate elimination and sorting by document order for every XPath step, which is very expensive and unnecessary in most cases. This is not currently supported by HXQ but will be addressed in the future (needs a static analysis to determine when duplicate elimination is necessary). For example, e//*//* may return duplicate elements in HXQ.

may return duplicate elements in HXQ. Attributes in constructed elements must be either embedded in the start-tag, and/or, if the element content is a sequence, they must appear at the beginning of the sequence as constructed attributes.

XQuery Documentation

The complete XQuery syntax in HXQ is described in hxq-manual.pdf. I have also written a paper that describes some of the database related methods used in the implementation. Here some tutorials on XPath and XQuery. Here are two relevant courses on XML and databases at Stanford and ETH.

Using the Compiler

The main functions for embedding XQueries in Haskell are:

$(xe query) :: XSeq

$(xq query) :: IO XSeq

query

XSeq

[XTree]

(IO XSeq)

v

$v

v

XSeq

(XSeq,...,XSeq) -> IO XSeq

Here is an example of a main program:

f(x,y) = $(xq "<article><first>{$x}</first><second>{$y}</second></article>") main = do a <- $(xq "<result>{ / / for $x at $i in doc('data/dblp.xml')//inproceedings / / where $x/author = 'Leonidas Fegaras' / / order by $x/year descending / / return <paper>{ $i, ') ', $x/booktitle/text(), / / ': ', $x/title/text() / / }</paper> / / }</result> ") putXSeq a b <- $(xq " f( $a/paper[10], $a/paper[8] ) ") putXSeq b

ghc -O2 --make tests/Test1.hs -o a.out

You can compile an XQuery file into a Haskell program ( Temp.hs ) using xquery -c file . Or better, you can use the script compile (on Unix/Mac or Windows) to compile the XQuery file to an executable. For example:

compile data/q1.xq

a.out

Using the Interpreter

The HXQ interpreter is far more slower than the compiler; use it only if you need to evaluate ad-hoc XQueries read from input or from files. The only function is:

xquery :: String -> IO XSeq

xquery

xquery data/q1.xq

xquery -p xpath-query xml-file

xquery -p "//inproceedings[100]" data/dblp.xml

xquery -help

XML Schema Validation and Type Inference

Currently, HXQ supports type testing and casting using the XQuery expressions: typeswitch, instance-of, cast-as, etc. The validation and type inference systems are still a work in progress. To use type inference, use the option -tp in xquery . To associate an XML document with an XML Schema, use the XQuery import schema statement. For example:

import schema default element namespace "dept" at "data/department.xsd"; validate {doc("data/cs.xml")//gradstudent}; (doc("data/cs.xml")//gradstudent[.//lastname='Galanis']//address) instance of element(address)*

validateFile

validateFile "data/dblp.xml" "data/dblp.xsd"

Last modified: 01/08/10 by Leonidas Fegaras