ijson

The usual solution to parsing JSON in Python is using either the included in the standard library simplejson or the third-party cjson that became popular recently. Both libraries process JSON in one piece: parse the whole thing and return a native Python object. However there is a certain value in processing JSON as a stream in SAX-like manner when the size of the payload starts to be counted in megabytes and if you don't need to actually store the whole object in memory.

It turns out that this problem is already solved by the library yajl (hat tip to Alex for pointing me to it) which is accompanied by two different Python bindings. I ended up not liking both of them so I've made my own — ijson.

The name exercises the same principle as do functions from itertools: "imap", "ifilter" etc.

Existing solutions

py-yajl was a non-starter because its interface provides functions "dumps" and "loads" and effectively hides all streaming nature of yajl.

Another one — yajl-py — looked more interesting. Using it is very similar to using a SAX parser: you define a handler class, pass it into the parser which then calls methods of your class responsible for handling individual events parsed from an input stream. The problem of this approach is that the call to the parser is synchronous: it reads the whole input by itself and there's no convenient means to pause and restart this process. And what I needed was essentially a Python iterator from which I could read events in my own loop. Having such an iterator I can wrap it into whatever processing I need and return it from my application into a WSGI server.

ijson

I reckoned that creating such iterator shouldn't be hard and refactored yjajl-py core function with the parsing loop into a generator. Then I looked around, refactored some more things, and then some even more. So my current parse.py bears almost no resemblance to the original yajl_parse.py :-). Nonetheless I am very grateful to Hatem Nassrat — the author of yajl-py — because if it wouldn't be for his code I'd never gotten to making a wrapper to a C library at all.

So there I've got an iterator basic_parse(file) which generates pairs of the form (event, value) corresponding to what is returned from a C library (plus conversion to native Python types). However I quickly realized that such events are hard to process because they don't provide any context: you get this ('map_key', 'name') thing but you have no idea in which objects this key lives. Thus I wrapped this basic parser into another one that would maintain the context. Instead of pairs it generates triples with an additional first value — a path showing where in the object tree an event occurred.

So given this document:

{ "array": [1, 2], "map": { "key": "value" } }

... the parser yields this sequence of events:

('', 'start_map', None) ('', 'map_key', 'array') ('array', 'start_array', None) ('array.item', 'number', 1) ('array.item', 'number', 2) ('array', 'end_array', None) ('', 'map_key', 'map') ('map', 'start_map', None) ('map', 'map_key', 'key') ('map.key', 'string', u'value') ('map', 'end_map', None) ('', 'end_map', None)

Having path prefixes (first item of each tuple) helps to get rid of the complex logic maintaining parsing state in the user code and allows to react only to events under certain keys or subtrees.

And just yesterday I added another iterator items(file, prefix) that instead of yielding raw events yields native Python objects found under the specified prefix:

from ijson import items for item in items(file, 'docs.item.name'): do_something_with(item)

This is obviously the most convenient way to handle parsing but it's bound to be a bit slower due to repeated creation of temporary objects.

Processing with coroutines

So the classic way to process streaming parsing is to have a class consisting of callbacks that are called upon individual events. All the necessary state is stored in an instance of that class. However this approach is known to be not exactly readable. Also it doesn't work at all if your processing code wants to use with statements because you can't wrap several callback calls into one with context.

Currently at work I'm busy generating XML data from JSON data. For this I use my own elementflow which doesn't require intermediate objects to produce XML serialization. So the obvious idea was to combine the event-based parser and the event-based generator into a fast memory-efficient pipeline. The only problem is that elementflow happens to relay heavily on that with statement.

Meet David Beazley's coroutines (fanfares)!

The basic idea is that instead of a class with callbacks you use a Python generator that can accept values through the .send() method. If you didn't see that presentation I urge you to do it right now because it explains the concept very well and I have no intention to repeat it with my own words.

With such generators using with is perfectly possible since you have only one function that accepts all the events:

@coroutine def converter(xml): while True: prefix, event, value == yield # get another event from the parser if prefix == 'rows.item.id': # store the value in a local var id = value elif prefix == 'rows.item.value': # create another coroutine that knows how to process # 'rows.item.value' contents target = value_coroutine(xml) # generate a containg XML element with xml.container('value', {'id': id}): # secondary loop cosuming 'rows.item.value' while (prefix, event) != ('rows.item.value', 'end_map'): prefix, event, value == yield # offload events into a target target.send((prefix, event, value))

Upon coding some time in this style I've noticed several repeating patterns and have extracted a couple of them into utility functions now available in ijson.utils:

foreach(coroutine_func) . Intended to handle a JSON array creating a new instance of a coroutine_func for each array item and feeding it the contents of the item.

dispatcher([(prefix, target), (prefix, target) ... ]) . Accepts a sequence of pairs of prefixes and their handler coroutines. It then feeds a handler all events found under the corresponding prefix.

This way the code looks like a tree of dispatchers whose leaves are generators of simple XML elements. Dispatchers are composed according to the structure of an input JSON tree. Looks a bit unusual but quite readable to my eye.

The final step is to arrange parsing and XML generation to happen simultaneously. This is done with a conventional generator using an in-memory queue. The generator is then passed right into a WSGI server:

def generator(file): queue = elementflow.Queue() with elementflow.xml(queue, 'root') as xml: g = consumer_coroutine(xml) for event in ijson.parse(file): g.send(event) if len(queue) > BUF_SIZE: yield queue.pop() yield queue.pop()

Is it worth it

In terms of efficiency, the short answer is yes. A real task of producing two megabytes of XML out of similar amount of JSON performs about 30% faster compared to a cjson variant even running under the ordinary forking Django server (meaning we don't get memory savings). I don't have any hard numbers at hand though, and this post is already rather long, so I plan to write another one on testing this thing.

In terms of code size and complexity this is all highly subjective. Anyway you have to be ready that streaming code will always be a bit bigger and hairier then the one dealing with objects.

An exercise: ObjectBuilder

Want some fun? :-)

Try to write code that would produce a native object out of an event sequence. This is what every non-streaming parser effectively does under the hood. Ijson already has this code — the "ObjectBuilder" class in the "parse" module — but I suggest you don't look at it just yet. The code, though compact, appears to be quite sophisticated. It would be nice if someone could come up with a simpler solution and in this case looking at the existing code may get in the way.