[Warning: this post got huge and it’s not even halfway done. I’ll post the second half (which justifies the title) in the coming days]

Over last winter I began a project that’s often kept me awake ever since abandoning App Engine: the ability to take an ordered-map key/value store and turn it into something you simply don’t need to think about while writing software, and by don’t need to think about, I mean that the impedence mismatch between the store and Python has been minimized, but not so much as to require a poison worse than the original affliction.

My experimentations aren’t quite near bearing fruit, however some ideas are beginning to materialize in the form of the py-lmdb (docs) and Acid (docs) packages.

Missing from those docs is a bigger picture around designing efficient applications, even for data sets larger than a single host, however it is a more complex story for another time, alongside some novel components that have yet to exist.

The headline is simply that contemporary application design is a freaking joke and we are all complicit. There is no reason a stock Django programmer should be producing software that requires a $15k/month EC2 spend just to hit 1,000 requests/sec with 40ms latency, and that they should be proud of it. If you were somehow involved in the establishment of this norm, expect no less than abject scorn emanating from across the table should we ever meet.

In truth contemporary web software efficiency and operational costs are easily 10x-100x off where they should be, with marked complexity decrease, if only a little more attention were paid to the solutions we advocate, and the answer is emphatically not “nginx/gunicorn/CrapwareDB will save your soul!”

Even modest inquiry into the implementation of modern operating systems would unmask the treachery of statements like these, but that bridge is a long ways in the distance. The essence is that before we can succeed in creating something big, we must grasp the art of constructing the small.

To reimagine the contemporary web stack as a web stack will require a number of steps, and there are many forgotten assumptions we must first revisit..

My mental task list continues to churn as I glacially coax Acid toward eventual practicality. Recently I’ve been attentive to fixing the horribly slow, broken iterator implementation, taking it from the wiry ad-hoc mess it is now, into a cleaner, more maintainable and efficient future, meanwhile remaining cognizant of how the result will behave in pure-Python environments like IronPython and PyPy.

The most visible product of the current implementation is that lookups are an order of magnitude slower than the underlying storage engines support. While already an order of magnitude faster than querying an external DBMS, that is nowhere near what is possible, and I didn’t come in search of a mediocre result.

And so the can of worms is laid bare..

Iterator implementation

The iterator’s goal is conceptually simple: return only records matching the user’s range criteria, where the criteria might yield a single record. However in this implementation the record may exist as part of a compression batch, and so already the first complexity is encountered: the ‘iterator’ really wants to be two distinct iterators. One that knows how to walk the contents of a batch, all the while respecting the user’s range criteria, and an outer iterator that manages a raw storage engine iterator, again paying attention to the user’s range requirements.

The outer iterator works in the bytestring domain: given a start key and scan direction, it ensures returned physical keys are within the absolute range of the collection (by testing for a bytestring prefix), feeding any legitimate physical records to the logical iterator.

Both must pay attention to the possibility that the first record returned by the engine may exceed either the collection’s range or the user’s range, and also to the possibility that the user’s range begins inside a batch record, and only a subset of that batch should be yielded. Similarly the termination criteria might exist within a batch, therefore it is not enough to check for termination from the physical iterator, the logical iterator must also implement the check.

The story is further complicated by what is currently 2 but will ultimately be at least 3 iteration modes, and by necessity of maintainability may need even more subdivision:

Index iteration, where keys are length-2 lists containing the index tuple and the target key tuple.

Iteration of “impure” batched collections, where compression may be used but item keys must all be saved, since the key function is not derived from the item’s value (for example with autoincrementing integers).

Iteration in “pure” collections, where the item key is based entirely on the item data, or from item data and some recoverable context, and so only the lowest and highest key from a batch need be preserved. Given the 2KB syslog example below, this produces a key that is 19 bytes instead of 901 bytes.

None of these requirements are particularly difficult to express to a machine, except that they must all be accounted for even when fetching a single record, should mostly be written in Python (perhaps with the help of some C speedups), and that there is no reason why the resulting efficiency should not approach that of the storage engine.

Current implementation

The iterator implementation sucks firstly because it constantly thinks about database keys in 2 domains: their physical bytestring representation and the user-visible 'tuple’ representation, which may be part of a list-of-tuples representation. A smorgasbord of partial functions and itertools abuse is employed to handle all possible option combinations, some relying on tuples and others on bytestrings.

Relying on the tuple representation forces us to decode the physical bytestring. This seemingly innocuous operation hides a woeful truth: for each key encountered, including those not matching the user criteria, we must produce a tuple containing a sequence of integers, strings and suchlike, all that must be heap allocated, and in the case of strings, copied and decoded from a null-safe form. During decoding the system allocator can (and often will) move partially decoded strings as they change storage class, depending on their size.

In the case of UUIDs we must construct a dict of keyword arguments and invoke a 50 line Python constructor, initializing an instance and its corresponding instance dict, all so that nanoseconds later the result can be discarded.

Taking UNIX syslog as an example, around 100 zlib-compressed syslog entries fit in a batch of under 2,000 bytes, which is an interesting size since two batches will snugly fit a database page, and the batch remains large enough to achieve almost 5.5x compression while small enough to be cheaply decompressed. Given a machine with 32GB of RAM, we can serve 192GB of logs without a single seek, at much higher rates than possible with an external DBMS running on a host with 192GB of RAM.

But this is a distraction. The point is to consider a random lookup in this collection, each entry identified by a 3-tuple of integers: we must potentially allocate and discard 400 heap objects to access a single record.

Clearly, fixing iteration must first involve addressing this representation madness, and the fabled utopia of pure-Python types everywhere simply might not make the cut.

Key Representation

Tuples were an obvious conceptual choice and a massive improvement over App Engine’s Unicode strings, however the Python tuple type itself may be too inflexible to meet the efficiency goal. Even if tuple production could be made free, comparisons would still be slower than simple string compares.

The reason tuple comparison is needed is due to the key decoder, which is currently implemented using unpacks(), an all-or-nothing function that produces a list of tuples from a bytestring. The main reason for this was simplicity: instead of incrementally calling into the decoder as more results are needed, a simple boundary exists between decoder and surrounding code.

What is really needed is either a key decoder that works incrementally on bytestrings, or a tuple-like type backed by bytestrings. With just an incremental decoder, either a custom interface must be provided and hard-wired into the code, and any consumer code wishing for the speedup, or comparison will continue to rely on tuples.

Clearly then, it seems the way forward is a new type backed by the bytestring representation. The type should look, squeak and bark like a tuple, but its logical manifestation should be computed lazily, and its __cmp__ implementation should require no decoding at all.

So despite the aesthetic devilishness of introducing a type that mimics something built into the language, it’s all decided and we’re good to go. Right? Well.. there’s still that issue of the backing bytestring, and how for batch records it is actually a substring of a larger string.

Oh and about that larger string, with py-lmdb it too is actually part of a much larger string, which for the time being we’ll simply call virtual memory. How the plot thickens.. it’s strings all the way down!

[To be continued]

Comments on a postcard to @edeadlk.