Performance boost for Hoa\Compiler

Written the 24th August 2016 .

A lot of work has been done to boost the performance of Hoa\Compiler , mainly by reducing the amount of memory required to lexically and to syntactically analyse a datum. Up to 80% of the memory can be saved with a large datum. Also a new feature allows to compile the in-memory parser into a PHP class. Up to 45% of the CPU can be saved for the parser initialisation.

Constant memory usage in the lexer

First of all, one of the biggest improvement lands in the lexer (aka the lexical analyser). The lexer is responsible of cutting a datum into a sequence of tokens (aka lexemes). So it reads the entire datum and produces an entire sequence. (Learn more by reading its documentation). Until today.

The issue with this approach is that if the datum is of length 1Mb, then the memory will contain 1Mb (the original datum) and at least 1Mb (the sequence). However, the sequence is represented as a PHP array, so it has an overhead of some bytes per tokens, in addition to the meta data attached to the token (like the name, the namespace, the length, the offset etc.). The resulting overhead size will depend on the PHP version in use but this is not optimal.

Thus, a straightforward optimisation is to transform the lexer into an iterator. Actually, it has been transformed into a generator. So each call to the lexer will produce the next token, without stacking all the tokens in memory.

With this new approach, the API stays identical and the memory peak usage is drastically reduced.

This graph presents 3 data of different size: small (18 bytes), medium (1091 bytes) and large (65432 bytes). These data are JSON strings. For each data, the memory peak usages have been analysed with the two approaches: Collection (before) and generator (after). The memory is almost constant with the generator approach, and the more the lexer has to consume, the greater the delta is.

Save memory and CPU in the parser, and pragmas

Now, the parser takes the benefits from the lexer since it is an iterator. Method calls have been reduced, so many CPU cycles have been saved. Also, indirections have been reduced. When a datum points to another datum that points to the final one, we count 2 indirections to get the final datum. An indirection has a cost. This cost has been reduced.

Also, the lexer is wrapped inside a Hoa\Iterator\Buffer iterator. This brings a new feature. One may remember that the parser is called Hoa\Compiler\Llk\Parser , so it is LL(k). But it was LL(*). Now this is a real LL(k) and we can set the value of k, thanks to the new feature introduced in the grammar description language: Pragmas. Indeed, by writing:

%pragma parser.lookahead 0

we obtain a LL(0) parser, for free. If the parser needs to go beyond the value of k, an exception will be thrown.

And with a buffer iterator wrapping the lexer, this is still possible to move forward and backward in the lexer without lexing tokens several times.

Exporting the parser into PHP code

Hoa\Compiler\Llk is a compiler-compiler. What it means is that given a grammar, it is compiled into a parser, and then the parser is used to get a compiler. This first compilation, grammar to in-memory parser, is not useful everytime. Actually, it must be done once before going to production.

So far, the solution was to serialise the in-memory parser and save it into a file. Serialisation is dangerous from a security point of view. The source must be trusted. It was not a comfortable situation for our users.

Now, it is possible to save the in-memory parser as a string representing PHP code. This PHP code instanciates the same parser (at its initial state). The parser takes the form of a PHP class. Thus, one might write this PHP code into a file, commit this file and use it as any regular PHP classes, with autoloaders and so on.

This graph compares the two following scenarii: Load the in-memory parser from a grammar and then use it, against load the in-memory parser from a grammar, save it into a file, and then use it as a regular class with a new operator to instanciate the parser. It has been applied to three different grammars landing in Hoa: Hoa\Json , Hoa\Math and Hoa\Ruler . The “save” scenario is almost twice faster.

Disabling Unicode support

Another pragma has been introduced to disable Unicode support in the lexer:

%pragma lexer.unicode false

This is particularly useful when the grammar defines a binary language or defines its own Unicode support. This is the case of Hoa\Json . The RFC7159 is under fully implementation and JSON uses the same UTF-8 format than JavaScript: With surrogate pairs. This was not possible to correctly lexically analyse a JSON string with PCRE Unicode support enabled. With this new pragma, this is possible.

Also, the JSON grammar defines a LL(0) parser thanks to the parser.lookahead pragma.

Quality with Grammar-based Testing algorithms

Integration test suites were landing in the Hoa\Compiler library, but there was no unit test suites. This uncomfortable situation is now fixed. New integration test suites have been written too. Now we have:

24 test suites,

136 test cases,

320,242 assertions.

Yes, this is not a mistake: 320,242 assertions. We obtain this number by using Grammar-based Testing algorithms, defined in the following research paper: Grammar-Based Testing using Realistic Domains in PHP . This paper has been written by the authors of the Hoa\Compiler library. You might enjoy reading this article.

Finally, 2 bugs have been found and fixed.

Next optimisations