More resources

What's a lexer?

A lexer is a string-eating state machine that uses regular expressions to define transitions from one state to another.

Lexers appear most frequently as the very first phase of a compiler, where they're used to "tokenize" the input. They ought to appear more frequently as protocol-parsers, but the blocking I described makes them unsuitable.

Lex is a lexer-generator tool that outputs C code. In lex, a lexer specification is a collection of rules that have the following form:

<STATE> regex { action }

In English, each rule means "if in state STATE and regex matches the longest prefix of the remaining input, match this prefix and perform action." Actions can be arbitrary code, and they have access to an API for emitting a token or changing to another lexing state.

I have another article on lexers (in JavaScript) if you want to know more.

A Scala DSEL for regular expressions

To embed a lexing toolkit in Scala, we'll need a way to describe regular expressions. We can't use the Java regex library, because we'll need to be able to compute the derivative of a regular expression, and for that, we need access to its structure.

The heart of the DSEL is the class RegularLanguage , whose interface describes all of the regular operations:

/** A regular language is a set of strings constructable from union, concatenation and repetition. */ abstract class RegularLanguage { // Kleene star; zero or more repetitions: def * : RegularLanguage ; // Exactly n repetitions: def ^ (n : Int) : RegularLanguage // Kleene plus: one or more repetitions: def + : RegularLanguage ; // Option; zero or one instances: def ? : RegularLanguage ; // Concatenation: def ~ (suffix: RegularLanguage) : RegularLanguage; // Union/alternation: def || (that : RegularLanguage) : RegularLanguage; // For computing derivatives: def derive(c : Char) : RegularLanguage ; def deriveEND : RegularLanguage ; // Properties: def acceptsEmptyString : Boolean ; def rejectsAll : Boolean ; def isEmptyString : Boolean ; } // Useful implicits for regular expressions: object RegularLanguageImplicits { implicit def charToRegEx(c : Char) : RegularLanguage ; implicit def stringToRegEx(s : String) : RegularLanguage ; trait CharRangeable { def thru (end : Char) : RegularLanguage ; } implicit def charToCharRangeable(start : Char); } // Matches end of input: case object END extends RegularLanguage // Matches no strings at all: case object EmptySet extends RegularLanguage // Matches the length-zero string: case object Epsilon extends RegularLanguage // Matches any character: case object AnyChar extends RegularLanguage

Live Streams: A non-blocking interface

Supposing we have a lexer defined as a collection of rules, what we want is a lexer encoded as an object that can consume input characters, one at a time, and perform lexing actions as soon as it gets enough input to do so. If we stop feeding the lexer input, or we don't give it enough, the lexer should return to us without hanging.

To pull this off, we'll use the abstraction of "live streams" for both the input to and the output of the lexer. A live stream is a linked list whose tail can grow dynamically. In Scala, my interface for live streams is:

class LiveStream[A](val source : LiveStreamSource[A]) { def isPlugged : Boolean ; def isEmpty : Boolean ; def head : A ; def tail : LiveStream[A] ; }

So, it's almost exactly a linked list. The difference is that a live stream has a source, and a particular stream node can be temporarily "plugged." A stream node is plugged if the source hasn't yet provided enough input for this location in the stream. (A plugged stream node becomes empty if the source declares the end of the input is reached.)

The interface for a stream source is a wrapper around a queue:

class LiveStreamSource[A] { def isTerminated : Boolean ; def isEmpty : Boolean ; def terminate() ; def addListener (listener : List[A] => Unit) ; def hasNext() : Boolean ; def next() : A ; def += (a : A) ; def ++= (seq : Iterable[A]) ; }

A useful addition here is the addListener method. Since the lexer encodes its output as a live stream, a client of this lexer can add a listener to pick up its computation once more input is available. Internally, the non-blocking lexer adds a listener to the input source, so that it automatically makes what progress it can any time more input becomes available.

A Scala DSEL for non-blocking lexers

To create a non-blocking lexer in Scala, you have to create a new class that inherits from NonblockingLexer , whose interface is sketched below:

// Eats C's, emit A's. abstract class NonblockingLexer[C <% Char, A] { // Attaches this lexer to a stream: def lex (input : LiveStream[C]) ; // Contains the emitted A's: def output : LiveStream[A] // Internal lexing states: abstract class LexerState ; // User-definable states: class MajorLexerState extends LexerState { trait Matchable { def apply (action : => Unit) ; def over (action : List[C] => Unit) ; } trait Switchable { def to (action : => LexerState) ; def over (action : List[C] => LexerState) ; } // Methods for adding rules to this state: def apply (regex : RegularLanguage) : Matchable ; def switchesOn (regex : RegularLanguage) : Switchable ; } class StatefulMajorLexerState[S] (private var state : S) extends MajorLexerState { // Methods for adding rules to this state: def apply(s : S) : StatefulMajorLexerState[S] def update (regex : RegularLanguage, stateAction : (S,List[C]) => LexerState); } // Called by actions: protected def emit(token : A) ; // Create new lexing states: def State() : MajorLexerState ; def State[S](state : S) : StatefulMajorLexerState[S] // Where the lexer starts: protected def MAIN : LexerState ; }

So, a lexer creates states with the State methods. Then, it associates rules with them using the methods inside states. Finally, it selects one of the lexing states as the initial lexing state: MAIN .

Example: A lexer for Scheme

As an example, here's the body of a lexer that tokenizes Scheme:

// Abbreviations: val ch = "#\\" ~ AnyChar val id = (('A' thru 'Z') || ('a' thru 'z') || ('0' thru '9') || oneOf("-+/*_?%$#&^=!@<>:"))+ val int = ("-"?) ~ ('0' thru '9')+ val ws = oneOf(" \r\t

")+ val com = ";" ~ ((!oneOf("\r

"))*) // States: val MAIN = State() val BANGCOMMENT = State(0) val STRING = State[List[Char]](List()) // Rules: // State switching rules MAIN switchesOn ("#!") to { BANGCOMMENT(1) } MAIN switchesOn ("\"") to { STRING(List()) } // Regular tokens MAIN (",@") { emit(PunctToken(",@")) } MAIN (",") { emit(PunctToken(",")) } MAIN ("`") { emit(PunctToken("`")) } MAIN ("'") { emit(PunctToken("'")) } MAIN ("#(") { emit(PunctToken("#(")) } MAIN ("(") { emit(PunctToken("(")) } MAIN (")") { emit(PunctToken(")")) } MAIN ("[") { emit(PunctToken("[")) } MAIN ("]") { emit(PunctToken("]")) } MAIN ("{") { emit(PunctToken("{")) } MAIN ("}") { emit(PunctToken("}")) } MAIN (".") { emit(PunctToken(".")) } MAIN ("#t") { emit(BooleanToken(true)) } MAIN ("#f") { emit(BooleanToken(false)) } MAIN (END) { terminate() } MAIN (ws) { } MAIN (com) { } MAIN (ch) over { chars => emit(CharToken(chars(2))) } MAIN (int) over { chars => emit(IntToken(chars)) } MAIN (id) over { chars => emit(SymbolToken(chars)) } // Nested #! ... !# comments BANGCOMMENT ("#!") = { (n,chars) => BANGCOMMENT(n+1) } BANGCOMMENT (AnyChar) { } BANGCOMMENT ("!#") = { case (1,chars) => MAIN case (n,chars) => BANGCOMMENT(n-1) } // Strings STRING ("\"") = { (string,_) => { emit(StringToken(string.reverse.mkString)) ; MAIN } } STRING ("\\\"") = { (string,chars) => STRING('"' :: string) } STRING ("\\\\") = { (string,chars) => STRING('\\' :: string) } STRING (AnyChar) = { (string,chars) => STRING(chars.reverse ++ string) }

Example: A lexer for HTTP requests

Here's the body of the lexer that parses an incoming HTTP request in my (newly enhanced) non-blocking web server:

// Abbreviations: val nl = "\r

" || "

" val headerLine = ((!oneOf("\r

:"))+) ~ ": " ~ ((!oneOf("\r

"))+) ~ nl // Lexing states: val METHOD = State() val RESOURCE = State() val VERSION = State() val HEADER = State() val DATA = State() val DONE = State() // Initial state: protected val MAIN = METHOD // Find the method, GET or POST: METHOD switchesOn ("GET ") to { method = GET; RESOURCE } METHOD switchesOn ("POST ") to { method = POST; RESOURCE } // Read in the resource: RESOURCE switchesOn ((!oneOf(" "))+) over { chars => { resource = chars ; VERSION } } // Grab (and ignore) the version: VERSION switchesOn (" HTTP/1.1" ~ nl) to { HEADER } VERSION switchesOn (" HTTP/1.0" ~ nl) to { HEADER } // Parse a header line: HEADER (headerLine) over { chars => val line : String = chars val parts = line.split(": ") val header = parts(0) val value = parts(1).trim headers = (headers(header) = value) // Watch for significant headers: header match { case "Content-Length" => { contentLength = Integer.parseInt(value) // The client intends to send data. // Extend the lexer to read the input data: DATA switchesOn (AnyChar ^ contentLength) over { chars => { // emit the request now: emit(new Request(conn, method, resource, headers, data)) this.data = chars DATA.reset() DONE } } } case _ => {} } } // Blank line means end of headers: HEADER switchesOn (nl) to { if (contentLength > 0) DATA else { // emit this request: emit(new Request(conn,method,resource, headers,data)) DONE } } DONE (END) { /* do nothing */ }

Compared with the previous request-parsing architecture, I'm now in a much better place to implement features like Keep-Alive connections. The old code was an ad hoc implementation of a finite state machine.

Under the hood: Regex derivatives

A traditional lexer-generator like lex converts all of the regular expressions into NFAs and compiles them into a special deterministic finite state machine. It then generates a C program that acts as the lexer.

This approach is too heavyweight and inflexible for modern lexing needs. Aside from the fact that lex produces blocking, non-thread-safe lexers, modern lexers might need to modify themselves on the fly. See how my HTTP lexer handles client data, for instance. Or, consider that some programming languages, even C, require the lexer to change its behavior as the program is parsed. Or, just imagine what kind of a programming language you could create with a lexer that can modify itself.

So, what we need is a way of amortizing the cost of what would have been lexer-generation across the lexing process itself. This means that NFA conversion, and especially DFA compilation, are out. We're going to have to work with the regular expressions themselves as the intermediate state of the lexer. This is where the derivative of a regular expression comes in.

I've already written a detailed explanation of these, so I won't go into much detail here. I'll just restate the definition of the derivative:

The derivative of a regular expression with respect to a character is a new regular expression that matches the tails of all strings beginning with that character and which match the original regular expression.

Confused? Don't be. It's a simple concept once you see some examples.

Most implementations of the method derive are one-liners!

In short, every time we consume a character, we're going to take the derivative of all of the rules in the current state with respect to that character. The result is the new lexing state, and a quick examination of the new rules tells us if we've found a longest match and need to fire an action. Even without optimizations or caching, this approach is fast.

Code