May 27, 2014

nullprogram.com/blog/2014/05/27/

Emacs Lisp strings are mutable, fixed-length character (multibyte) or byte (unibyte) arrays. Any operation that would change its length requires allocating a new string object. This is common in many programming languages’ strings. Python, Java, and JavaScript go even further, with strings being completely immutable.

In these languages, performing many string operations at a time, especially with the += operator, allocates many temporary strings. It’s also awkward. For these situations, Java provides a class, StringBuilder, so that these operations can be done with a temporary, efficient, mutable data structure that will emit the final string when complete.

java . util . Collection < T > collection ; public String toString () { StringBuilder sb = new StringBuilder (); for ( T element : collection ) { sb . append ( element ); } return sb . toString (); }

In JavaScript a popular string building idiom is to use an array. Push the components onto an array and join() the result.

function toString ( object ) { var output = []; for ( var k in object ) { output . push ( k ); output . push ( ' -> ' ); output . push ( object [ k ]); output . push ( '

' ); } return output . join ( '' ); } toString ({ a : 1 , b : 2 }); // => "a -> 1

b -> 2

"

Emacs Lisp

What character sequence data structure already exists in Elisp that’s efficient at insert, update, and delete? Buffers, of course! I know it’s easy to forget, but editing sequences of characters is the primary purpose of Emacs, after all. To make use of a buffer as a string builder, use one of my favorite macros: with-temp-buffer . I like to combine this with setting standard-output so that all of the printing functions go there.

( defun to-string ( alist ) ( with-temp-buffer ( let (( standard-output ( current-buffer ))) ( dolist ( pair alist ) ( princ ( cl-first pair )) ( princ " -> " ) ( princ ( cl-second pair )) ( princ "

" ))) ( buffer-string )))

Update: Jon O. pointed out that Emacs has a with-output-to-string macro available to do this more concisely.

Internally Elisp buffers are gap buffers, a rather simple data structure where the data is split into two sequences with a “gap” in between. Insertion and deletion occurs at the gap, which is slid up and down the overall sequence. This makes gap buffers efficient for making lots of edits localized in a single area, just as a human would do while editing text.

Each character in a buffer is a full Unicode code point and can have an arbitrary set of properties associated with it (font-lock-face, read-only, nonstickiness, etc.). Along with inline image objects, this makes buffers rich enough to display rendered HTML (to a limited extent).

The Catch

There’s an important caveat to using buffers as mutable strings: they’re not managed by the garbage collector. Each buffer goes into the global buffer list, implemented internally as an intrusive linked list. If a buffer is not on this list, it’s a dead buffer.

Ultimately this makes buffer objects poor return values. It’s an impedance mismatch. The caller has to be careful to free (“kill”) the buffer. It’s easy to miss if an error is signaled. For example, url-retrieve and url-retrieve-synchronously return a buffer with the response from a web server. It’s not uncommon for Elisp programs to leak these buffers during normal operation.

( with-current-buffer ( url-retrieve-synchronously some-url ) ( setf ( point ) url-http-end-of-headers ) ( prog1 ( json-read ) ( kill-buffer )))

If json-read fails, the buffer is leaked.

As a side note: alternatively you could use my finalize package to associate the buffer with an object that is subject to garbage collection. The buffer will be killed immediately when the object is garbage collected.

Buffer Passing Style

To deal with this, my preferred idiom is what I call buffer-passing style. Rather than have the callee instantiate the buffer, the caller instantiates the buffer and “passes” it implicitly as the current buffer. The callee fills it with something. The caller should use something like with-temp-buffer so that the buffer has a clean life-cycle, fully managed by the caller.

Imagine instead of returning a buffer, url-retrieve-synchronously puts the result in the current buffer instead of returning a buffer. If anything goes wrong, the buffer will be automatically killed by with-temp-buffer .

( with-temp-buffer ( url-retrieve-synchronously some-url ) ( setf ( point ) url-http-end-of-headers ) ( json-read ))

Buffer-passing style is what I settled on for simple-httpd. Servlets are called with the output buffer as the current buffer and with standard-output set to this buffer. The servlet is only responsible for filling this buffer with content. Thanks to process-send-region , the content is never actually copied into a string.

( defservlet* search :application/json ( q ) ( princ ( json-encode ( search-results q ))))

I didn’t recognize buffer-passing style until much later. As a result, far too much of simple-httpd is still string oriented when it shouldn’t be.