The art of tokenizing (Photo: Sergei Zolkin)

I’ve decided to create a lexer (a.k.a scanner) for an upcoming hobby project. Before creating the lexer, I wanted to see how a lexer can be implemented in Go. Also important for me was how to implement it with an idiomatic Go usage. During my research I’ve found that in Go land most of the lexers out there are written in two forms (in terms of API usage).

I’ve also checked many other open sourced packages (thanks godoc!) and they all are influenced from the approaches I’ve described above. Even though the first approach from Rob Pike is cool, he also decided to rewrite some of his own project to a more traditional one:

That talk was about a lexer, but the deeper purpose was to demonstrate how concurrency can make programs nice even without obvious parallelism in the problem. And like many such uses of concurrency, the code is pretty but not necessarily fast. I think it’s a fine approach to a lexer if you don’t care about performance. It is significantly slower than some other approaches but is very easy to adapt. I used it in ivy, for example, but just so you know, I’m probably going to replace the one in ivy with a more traditional model to avoid some issues with the lexer accessing global state. You don’t care about that for your application, I’m sure. So: It’s pretty and nice to work on, but you’d probably not choose that approach for a production compiler.

I’m not going to dive into the code and explain how exactly these approaches work since both the video from Rob Pike and the blog posts are well written. Instead I’ll be focusing on the API and the usage.

Let’s start with looking at the traditional scanning styles available in the standard library: text/scanner and go/scanner. The following usages are API reflections, how the lexer works in the underlying code is not important for now.

Tokens, their positions and literal texts

Before we dive in into the API definition of the packages let me introduce how a Token can be defined. I really like how go/token handles it. Here is a simplified version:

// Token is the set of lexical tokens of the Go programming language.

type Token int // The list of tokens.

const (

// Special tokens

ILLEGAL Token = iota

EOF



literal_beg

IDENT // main

INT // 12345

STRING // “abc”

literal_end



keyword_beg

BREAK

FUNC

SELECT

keyword_end

)

Tokens are just enumerations based on what you think should be tokenized. For example you could tokenize the literals true and false to be a token.BOOL token. However you also can say that it can be just an identifier. Please check out the full source code. What is interesting here are the constants literal_beg and keyword_beg. This allows us to check easily if a given token belongs to a set of tokens:

// IsLiteral returns true for tokens corresponding to identifiers

// and basic type literals; it returns false otherwise.

func (tok Token) IsLiteral() bool {

return literal_beg < tok && tok < literal_end

} // IsKeyword returns true for tokens corresponding to keywords;

// it returns false otherwise.

func (tok Token) IsKeyword() bool {

return keyword_beg < tok && tok < keyword_end

}

So this is how you define your tokens. The meaning is up to you and the language/syntax you are trying to parse. However only having a token is not useful. You also need the position and the literal string of the token.

The position is a type that holds the line, column and probably also the offset. Given a file you can directly jump to the token with this information.

The literal string needs some explanation. Certain tokens have also values that needs to be extracted. Keywords for example don’t have values. They are predefined names (such as token.BREAK, token.SELECT,etc..). However say for a token.STRING the value is there and needs to be defined. This process is called by the Evaluator part of the lexer. So the value of certain tokens should be retrieved and provided by the lexer. If we have a token.INT, the value might be “0x123”, “1234” or “45e+10”. This all depends on what you define an int should be of course.

As you see, the lexer doesn’t try to give a meaning. The meaning is defined on what you want certain characters to be recognized as.

Current scanner APIs

Let’s move on and observe how text/scanner and go/scanner provides an API that provides us both the position and the text literal.

Let us start looking go/scanner. This package is used by the official Go tools itself. So it’s battle tested and works very well. And because it’s intended to be used with Go syntax, the constructor is also different. Below is the official example from the doc:

// src is the input that we want to tokenize.

src := []byte(“cos(x) + 1i*sin(x) // Euler”) // Initialize the scanner.

var s scanner.Scanner

fset := token.NewFileSet()

file := fset.AddFile(“”, fset.Base(), len(src))

s.Init(file, src, nil, scanner.ScanComments) // Traverse the input source

for {

pos, tok, lit := s.Scan()

if tok == token.EOF {

break

}

fmt.Printf(“%s\t%s\t%q

”, fset.Position(pos), tok, lit)

}

API of go/scanner:

type Scanner

func (s *Scanner) Init(file *token.File, src []byte, err ErrorHandler, mode Mode)

func (s *Scanner) Scan() (token.Pos, token.Token, string)

Notes about the API: