Since I first started programming I was always been interested in understanding how a programming language works. A lot happens in the background and I feel that understanding a layer underneath the language you program in can be very beneficial to understanding a language as a whole.

First I decided that I would start by writing a compiler that would convert from my own defined language into C++. This ended up working quite well but seems somewhat like cheating as you are still using GNU compilers. I started learning Scala which then interested me in writing a language that converts into Java byte-code.

My chosen language for writing the compiler is Scala. The project can be easily ported to other languages.

The aim of this blog is to create a compiler that will convert our code into Java byte-code. The language will be of a similar syntax to that of Java/Scala but indented instead of using curly braces.

Why convert into JVM byte code?

Full access to all existing Java libraries

Cross-platform

JVM deals with memory handling

Project Structure

Seeming as this project will be rather large I’ve decided to split it into sections.

Lexer (Tokenization)

The lexer will take a string and convert it into tokens using regular expressions as a set of rules.

Parser

A check is made using regular expressions decides what the code represents.

Block Creation (Abstract Syntax Tree)

Blocks are used to represent the code in the memory.

Generate Java ASM code

The blocks are looped through and the Java code using the ASM library is output to a file.

Lexer (Tokenization)

The main focus today is on Tokenization. Tokenization will be used to split a string into multiple tokens assigning them a type and a value.

// Token.scala class Token(val tokenInit: String, val typeInit: TokenType) { def token: String = tokenInit def tokenType: TokenType = typeInit override def toString: String = tokenInit }

A token type is defined using an enum that we then link to a regular expression. Each token will have the type set.

// TokenType.scala object TokenType extends Enumeration { type TokenType = Value val EMPTY, // "^( )" TOKEN, // "^(=().,')" IDENTIFIER, // "^([a-zA-Z][a-zA-Z0-9]*)" INTEGER_LITERAL, // "^([0-9]+)" COLON = Value // "^([:])" }

The TokenData class stores the regular expression and TokenType.

// TokenData.scala class TokenData(val pattern: Regex, val `type`: TokenType) { def getPattern: Regex= pattern def getType: TokenType = `type` }

The tokenizer takes the string and loops through all token datas to see if there is a match. If there is then that is the next token.

// Tokenizer.scala class Tokenizer(var str: String) { private val tokenDatas: ListBuffer[TokenData] = ListBuffer[TokenData]( new TokenData("^([0-9]+)".r, TokenType.INTEGER_LITERAL), new TokenData("^([a-zA-Z][a-zA-Z0-9]*)".r, TokenType.IDENTIFIER) new TokenData("^([:])".r, TokenType.COLON) ) for (t <- List[String]("\\=", "\\(", "\\)", "\\.", "\\,", "\\'")) tokenDatas += new TokenData(("^(" + t + ")").r, TokenType.TOKEN) def nextToken: Token = { str = str.trim if (str.isEmpty) { new Token("", TokenType.EMPTY) } else { for (data <- tokenDatas) { val matched = data.pattern.findFirstIn(str) if (matched.isDefined) { val token: String = matched.getOrElse("") str = str.replace(token, "") return new Token(token, data.getType) } } throw new IllegalStateException("Could not parse:" + str) } } def hasNextToken: Boolean = !str.isEmpty }

Using this tokenizer we can now step through sections of code to extract data.

Here is some test code to show it in action.

// TokenizerTest.scala object TokenizerTest { def main(args: Array[String]): Unit = { val tokenizer = new Tokenizer("x = 15") val varToken = tokenizer.nextToken println("Variable Token - Value: " + varToken.token + " Type: " + varToken.tokenType) val assignmentToken = tokenizer.nextToken println("Assignment Token - Value: " + assignmentToken.token + " Type: " + assignmentToken.tokenType) val intToken = tokenizer.nextToken println("Integer Token - Value: " + intToken.token + " Type: " + intToken.tokenType) } }

Output:

Variable Token - Value: x Type: IDENTIFIER Assignment Token - Value: = Type: TOKEN Integer Token - Value: 15 Type: INTEGER_LITERAL

Source Files

To access the complete files access the tokenizer directory within my project.

https://github.com/Michael2109/cobalt