Erlsom is an Erlang library to parse (and generate) XML documents.

Erlsom can be used in a couple of very different modes:

As a SAX parser. This is a more or less standardized model (see http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html) for parsing XML. Every time the parser has processed a meaningful part of the XML document (such as a start tag), it will tell your application about this. The application can process this information (potentially in parallel) while the parser continues to parse the rest of the document. The SAX parser will allow you to efficiently parse XML documents of arbitrary size, but it may take some time to get used to it. If you invest some effort, you may find that it fits very well with the Erlang programming model (personally I have always been very happy about my choice to use a SAX parser as the basis for the rest of Erlsom).

As a simple sort of DOM parser. Erlsom can translate your XML to the ‘simple form’ that is used by Xmerl. This is a form that is easy to understand, but you have to search your way through the output to get to the information that you need.

As a ‘data binder’ Erlsom can translate the XML document to an Erlang data structure that corresponds to an XML Schema. It has the advantage over the SAX parser that it validates the XML document, and that you know exactly what the layout of the output will be. This makes it easy to access the elements that you need in a very direct way. (See http://www.rpbourret.com/xml/XMLDataBinding.htm for a general description of XML data binding.)

If the document is too big to fit into memory, or if the document arrives in some kind of data stream, it can be passed to the parser in blocks of arbitrary size.

The parser can work directly on binaries. There is no need to transform binaries to lists before passing the data to Erlsom. Using binaries as input has a positive effect on the memory usage and on the speed (provided that you are using Erlang 12B or later - if you are using an older Erlang version the speed will be better if you transform binaries to lists). The binaries can be latin-1, utf-8 or utf-16 encoded.

The parser has an option to produce output in binary form (only the character data: names of elements and attributes are always strings). This may be convenient if you want to minimize the memory usage, and/or if you need the result in binary format for further processing. Note that it will slow down the parser slightly. If you select this option the encoding of the result will be utf-8 (irrespective of the encoding of the input document).

Read the documentation