Language parsing has always been very interesting to me. There are lots of different ways to parse languages. Today we will implement a simple lexer.

Now, strictly speaking, a lexer is a program that analyzes a string or sentence creating a list of tokens that represents the contents of the string. Many times, a lexer will generate tokens and a parser will then parse those tokens, creating a normalized AST of the code or sentence being parsed.

The Go standard library provides an excellent example of a lexer to work from in its text/template package. The code is straightforward and fairly easy to understand. We can see that it works under the following premise:

Starts by lexing text until it reaches an opening "{{" Lexes inside the action with various functions to determine keywords and expressions Lexes the closing "}}" and returns to lexing text until EOF

Make a simple lexer that can parse an English sentence

At its very core, a sentence in English has: whitespace, words, and punctuation. Obviously, a sentence is much more complex than that, with words having a type and different punctuations causing different inflections and meanings. However, we will simplify the problem initially to keep the example concise.

First lets define our tokens:

type tokenType int const ( tokenWord tokenType = iota tokenWhitespace tokenPunctuation tokenEof ) var names = map[tokenType]string{ tokenWord: "WORD", tokenWhitespace: "SPACE", tokenPunctuation: "PUNCTUATION", tokenEof: "EOF", }

We have defined four tokens: a token for words, a token for whitespace, a token for punctuation, and an "EOF" token signifying the end of the input.

Now, since this is a lexer, we should have a few regular expressions:

var wordRegexp = regexp.MustCompile("[A-Za-z]+") var whitespaceRegexp = regexp.MustCompile("[\\s]+") var punctuationRegexp = regexp.MustCompile("[\\p{P}\\p{S}]+")

These are pretty straightforward regular expressions: the first matching any alphabet character, the second matching any space (including newlines and tabs), and the last matching any punctuation.

Next, we will define our token structure and a String() method which will pretty-print the token when we pass it to fmt.Println

type token struct { value string pos int tokenType tokenType } func (tok token) String() string { return fmt.Sprintf("{%s '%s' %d}", names[tok.tokenType], tok.value, tok.pos) }

Now that we have our tokens and regexes defined, lets start on the lexer itself.

type stateFn func(*lexer) stateFn type lexer struct { start int // The position of the last emission pos int // The current position of the lexer input string tokens chan token state stateFn } func (l *lexer) next() (val string) { if l.pos >= len(l.input) { l.pos++ return "" } val = l.input[l.pos : l.pos+1] l.pos++ return } func (l *lexer) backup() { l.pos-- } func (l *lexer) peek() (val string) { val = l.next() l.backup() return } func (l *lexer) emit(t tokenType) { val := l.input[l.start:l.pos] tok := token{val, l.start, t} l.tokens <- tok l.start = l.pos }

As you can see, we've created a special type alias called a stateFn . These stateFn s will be what actually perform the lexing, with each stateFn function returning another stateFn or nil if it has reached the end of input.

Then, we define our lexer with its four methods:

next() increases the lexer's position by one and returns the character at that position backup() decreases the lexer's position by one peek() returns the next character, but does not move the lexer's position emit() emits a token with contents beginning at l.start until the current position

Now that we have our lexer, we can create the actual lexing functions. First, lets define a word lexing function:

func lexWord(l *lexer) stateFn { matched := wordRegexp.FindString(l.input[l.pos:]) l.pos += len(matched) l.emit(tokenWord) return lexData }

We can see this function matches the first word (as defined by our regular expression) starting at the input's current position. It emits this word as a token and then returns a yet-to-be-defined stateFn .

Lets define our whitespace and punctuation lexing functions:

func lexPunctuation(l *lexer) stateFn { matched := punctuationRegexp.FindString(l.input[l.pos:]) l.pos += len(matched) l.emit(tokenPunctuation) return lexData } func lexWhitespace(l *lexer) stateFn { matched := whitespaceRegexp.FindString(l.input[l.pos:]) l.pos += len(matched) l.emit(tokenWhitespace) return lexData }

These are both very similar to our lexWord function. Now that we have the main workers created, lets define the lexData function

func lexData(l *lexer) stateFn { v := l.peek() switch { case v == "": l.emit(tokenEof) return nil case punctuationRegexp.MatchString(v): return lexPunctuation case whitespaceRegexp.MatchString(v): return lexWhitespace } return lexWord }

We can see that lexData is responsible for deciding which lexing function to use. It peeks at the next character, then matches that character against a check for EOF, punctuation, and whitespace, and then defaulting to the lexWord function if none of the others match.

Finally, define a helper tokenize method and a newLexer function to help kick things off:

func (l *lexer) tokenize() { for l.state = lexData; l.state != nil; { l.state = l.state(l) } } func newLexer(input string) *lexer { return &lexer{0, 0, input, make(chan token), nil} }

We can see when we call l.tokenize() , the lexer will start with lexData and continue lexing until the current state is nil.

Now in our main function, we can try out our lexer by first instantiating it and then sending the lexer off in its own goroutine to complete its work

func main() { lex := newLexer("This is a test-aculous test, sir...") go lex.tokenize() for { tok := <-lex.tokens fmt.Println(tok) if tok.tokenType == tokenEof { break } } }

View the entire source code here.

If we now try out our program, we get the following output:

{WORD 'This' 0} {SPACE ' ' 4} {WORD 'is' 7} {SPACE ' ' 9} {WORD 'a' 11} {SPACE ' ' 12} {WORD 'test' 13} {PUNCTUATION '-' 17} {WORD 'aculous' 18} {SPACE ' ' 25} {WORD 'test' 26} {PUNCTUATION ',' 30} {SPACE ' ' 31} {WORD 'sir' 32} {PUNCTUATION '...' 35} {EOF '' 38}

Pretty neat! We can easily add additional tokens by defining a new token, a regular expression, and a stateFn .

We can also use other means of matching which may be faster than regular expressions. For instance, text/template uses strings.HasPrefix to decide which lexing function to use, and then it iterates over each character rather than matching using regular expressions.

Additional Resources