In the last week I’ve written a new Javascript lexer, jslex. Why I did it is one of those open source adventures that starts innocently enough.

I’m working on a Django project for a client, and it needs to be localized into their language. Django has good support for localization, providing tools for extracting strings from Python, HTML, and Javascript files. But something wasn’t right: the client reported that some of the strings were still in English. Usually this means that they made a small mistake during the translation process, and the English in the source doesn’t match the English in the message file.

But when I looked, it turned out the English was completely missing from the message file. Check the source: yup, it’s properly marked for translation. Then I remembered: parsing Javascript source files for messages is fragile. I’d encountered this before, and had simply fiddled with the Javascript source to make the problem go away. But this time, as one message was re-harvested, other messages would disappear. The problem seemed more severe than I had encountered in the past. I decided to learn more about why it was happening.

Like many open source projects, Django uses Gnu gettext to manage the message files, including using the xgettext tool to parse the source files to find strings to translate. But xgettext doesn’t support parsing Javascript. Django has a strange accomodation to deal with this: it performs a simple transformation on the Javascript source, then tells xgettext that it’s Perl.

I can only guess why Perl was chosen: because Javascript and Perl both have regex literals, which as we’ll see, play a large part in this story. But Django’s Javascript-to-Perl transformation is simplistic: it just converts all //-comments on their own line into #-comments. So this Javascript:

// My awesome Javascript

x = 1 ; // Don't start x at zero.

gettext ( "Please translate me!" );



gets transformed into this “Perl”:

# My awesome Javascript

x = 1 ; // Don ' t start x at zero .

gettext ( "Please translate me!" );



I assume the reason //-comments that share a line with code are skipped is to avoid clobbering strings with // in them, though with multi-line strings, even that is not enough to protect them.

Of course, this transformation is insufficient to properly carry the strings into the “Perl” so that xgettext can find them. For example, in the above sample, the Javascript comment on line 2 is still executable Perl code after the transformation, and the apostrophe in the comment is considered the start of a string literal, so the gettext call is skipped as part of a multi-line string.

In fact, depending on the version of gettext, which determines how advanced its Perl parsing is, all sorts of innocuous Javascript constructs can throw off the parser:

gettext ( "Message on 1" );

var x = y ;

gettext ( "Message on 3" );

gettext ( "Message on 4" );

gettext ( "Message on 5" );



Here messages 1 and 5 are found, and 3 and 4 are not. How come? Because Perl’s y operator consumes two strings delimited by the next character, in this case a semicolon, so lines 3 and 4 are considered literals rather than code.

In truth, Django’s accommodation for Javascript is an egregious hack. So I wanted to find a better solution. I figured that if I could properly lex Javascript, then I could manipulate the token stream to create something that could reliably be parsed by gettext.

The result is jslex, a pure-Python lexer for Javascript. Lexing Javascript turns out to be tricky due to our old friend the regex literal. When a slash character is found, it could mean one of four things: a division operator (either / or /=), a line comment (//), a multi-line comment (/*), or a regex literal. The two comment forms are simple to deal with, because a regex literal can’t be empty, so // is always a comment, and a regex can’t start with a star, so /* is always a comment.

But distinguishing between division and regexes is impossible to do at a purely lexical level, and can be quite subtle:

for ( var x = a in foo && "</x>" || mot ? z : /x:3;x<5;y</g / i ) { xyz ( x ++ );}

for ( var x = a in foo && "</x>" || mot ? z / x : 3 ; x < 5 ; y < /g/i ) { xyz ( x ++ );}



The first line has a regex of /x:3;x<5;y</g, the second has /g/i.

The ECMAScript standard says you need to parse the code, and if you’re at a point where a regex literal would be a valid next token, then lex it as a regex, but if you’re at a point where a division would be valid, that lex it as division.

I wasn’t willing to write a full parser, but I’ve taken a similar approach to other light Javascript tools, and use the previous token to decide if the next token can be division or regex. It seems to work well.

The lexer is a general-purpose multi-state lexer built on regular expressions. The rules create a two-state lexer with a state for “division possible,” and “regex possible.” When I thought I had it working, I outsourced the QA to Stack Overflow, finally finding something to do with my too-many reputation points: pay a bounty to find Javascript it doesn’t lex properly. Mind-twistingly, a respondent there found a useful test: a Javascript lexer written in Javascript, which when fed through my lexer, failed because my regex-matching regex couldn’t properly lex his regex-matching regex!

To bridge Javascript code to xgettext, I chose to transform it into “C” instead of Perl. That means getting rid of the regex literals by turning them all into the C string “REGEX”, and changing single-quoted strings into double-quoted strings.

The next phase is to determine whether this gets into Django or not. I’ve prepared it as a patch, but there was already some momentum to replace gettext with Babel, and it’s looking like it might all have to wait for 1.4 in any case. As someone who’s recently lost time to this bug, I would really rather get something into 1.3.1, so we’ll see where that ends up.

In any case, if you have need for lexing Javascript in Python, use jslex, it works.