compiled lisp file format (Re: Skipping unexec via a big .elc file)

From: Ken Raeburn Subject: compiled lisp file format (Re: Skipping unexec via a big .elc file) Date: Sun, 21 May 2017 04:44:01 -0400

I haven’t had much time to further the work on the big-elc approach recently, but there is one idea I want to toss out there for possibly improving the load time further: Changing the .elc file format to a binary one. I’m not talking about a memory image like Daniel is working on. I mean a file representing a sequence of S-expressions, but optimized for loading speed rather than for human readability. The Guile project has taken this idea pretty far; they’re generating ELF object files with a few special sections for Guile objects, using the standard DWARF sections for debug information, etc. While it has a certain appeal (making C modules and Lisp files look much more similar, maybe being able to link Lisp and C together into one executable image, letting GDB understand some of your data), switching to a machine-specific format would be a pretty drastic change, when we can currently share the files across machines. I haven’t got a complete, concrete proposal, but I see at least a couple general approaches possible: 1) Follow the model of flat object file formats: Some file sections have data of various types (string content, symbol names, integer or floating constants); others (the equivalent of standard object file “relocation” data) would provide info on how to allocate and fill in the container objects (pairs, vectors, etc) desired, with references to the symbols or strings or other container objects. 2) Continue to use the current recursive processing, but with a binary format. Some (byte? word?) value indicates “this is string data”, it’s followed by a byte count and that many bytes of string content (always using the Emacs internal encoding, so we don’t have to translate when reading). Another value indicates an integer constant. Another value indicates a vector, and is followed by a length and then that many other values, which are each processed recursively before we get back to the object following the vector. Each object’s initializer’s length is dependent on the type, and for container types, the values contained within. Either way, getting away from the expensive one-character-at-a-time processing, multibyte coding, escape processing, etc., and pushing around groups of bytes whenever possible should save us time. This would be useable not just for the dumped.elc file, but for other compiled Lisp files as well, whether in the distribution or from ELPA or the user’s own code. I did throw together a half-baked attempt to try some of this out. I added a new “#” construct for unibyte strings, putting the byte count into the file so that the string data could be copied with fread() instead of a READCHAR loop. I also added a new version of the “#n#” syntax that uses a fixed number of READCHAR calls and avoids the decimal arithmetic. So, the file can no longer be processed as Lisp, and it still requires some text parsing, though not nearly as much as before; some of the worst of both worlds. But the load time for dumped.elc did drop by another 12% in my tests (start in batch mode, print a message and exit, from 0.227s down to 0.2s or less per run, still loading a couple of standard-elc-format files during startup). I’m curious if people think this might be an approach worth pursuing. Or if the Lisp-based elc format is seen as advantageous in ways I’m not seeing…. Ken