Lexing

Compilation of source code usually consists from three steps: lexing, parsing, generating executables. Lexing is conversion of text (stream of characters) to stream of tokens. One token can be represented by multiple characters. For example source code

function a() {

var a = 1;

}

could be transformed to a stream of tokens similar to these

KFunction, Identifier("a"), LBracket, KVar, Identifier("a"), Assignment, Number(1), Semicolon, RBracket

Lexing phase is important for simplifying syntax grammar that is used for parsing and lexer can perform uniformization of code such as removing comments, replacing escape characters.

Even though writing a lexer is not a rocket science doing this part properly is very important. Lexers could be processing megabytes of data and they are used by many development tools. Thus it’s crucial that they are fast.

State machines in Rust

For the state machine I’ve used technique explained in Hoverbear’s blog post. A state machine is represented by its state, state machine can move between states using predefined transitions.

We can use this concept when creating lexers. Lexer can start in Initial state and define transition under ; to Semicolon state. Semicolon is an accepting state. If we encounter ; while reading input we try to transit from current state (could be Initial state) to another state that transits under semicolon. If we end up in final state we produce a token, in this case a Semicolon token. On the example below we see a state machine for recognizing semicolon a final state is represented by diamond and Undefined state is for undefined transitions.

Semicolon state machine

One of the implementations of this model in Rust-lang could look like the example below. We see a generic structure over S and then we define how to transit from one state to another by implementing the From trait. And then we define a wrapper that can store our state machine.

struct StateMachine<S: State> {

state: S,

}

impl From<StateMachine<Initial>> for StateMachine<Semicolon> {

fn from(_st: StateMachine<Initial>) -> Self {

StateMachine { state: Semicolon}

}

}

enum StateMachineWrapper {

Initial(Initial),

Semicolon(Semicolon),

}

let st = StateMachineWrapper::Initial(StateMachine::new());

let st = StateMachineWrapper::Semicolon(st.into()); // Initial -> Semicolon state

A simplified version of the main parsing loop could look like this. A real world example contains a few additions to improve performance / handle edge cases.

pub fn parse(input: &[u8]) -> Result<Vec<Token>, Error> {

let mut st = StateMachineWrapper::Init(StateMachine:new());

let mut tokens = Vec::with_capacity(input.len()); // Avoid reallocation, size is known

while c_src < input.len() {

while !st.is_final() {

let ch = input[c_src] }; // Get one character

st = st.step(ch); // Transit from state to state using character

c_src += 1; // Next char

token_len += 1;

}

let token = &input[c_src - token_len as usize..c_src - 1];

let token = get_token_from_str(token);

tokens.push(token.unwrap());

reset_to_initial_state();

}

Ok(tokens)

}

Javascript lexing

In this section I’m going to explain implementation of my javascript lexer implemented in Rust.

As a basic idea I use state machine as described in the previous section. Javascript state machine is a bit more complex but still relatively simple. We can see all states defined in a separate module. Then wrapper implements transition function which for given character group transfers from one state to another.

What’s character group? It turns out that many characters don’t have any special syntactic meaning and can be represented together with other characters. For example all letters with exception those characters that can be used inside hexadecimal number. Character group don’t only help us to simplify our code, but also help us to minimize size of data that has to be cached in CPU during execution.

Transitions for Javascript state machine are total, that means that for every state we have outgoing transition for all letters. However, letters that are not allowed in given state end up in special state called Hell . This state represents invalid transition and as soon it’s reached lexer stops with error. Rust makes defining transition very simple thanks to pattern matching. Then definition of transition could look like in the example below. Where we can see transition rule for comments. This transitions mean consume any characters until new line character is encountered.

(StateMachineWrapper::SingleLineComment(s), Equivalence::LineTerminator) => { StateMachineWrapper::SingleLineCommentAcc(s.into()) } (StateMachineWrapper::SingleLineComment(s), _) => { StateMachineWrapper::SingleLineComment(s) }

Performance optimization

In this section I will explain what performance related problems I’ve encountered while writing Javascript lexer and how I’ve solved them.

One of the biggest performance gains is preallocation. Dynamic structures in Rust allocate by default size of 0 and double their size when they need more. This allocation strategy is inefficient for token vector or unescaping characters. However, when parsing we always know the maximum size of the result. And preallocating maximum size using Vec::with_capacity or String::with_capacity drastically increased performance thanks to saving a lot of reallocations.

Branch prediction plays important role when trying to squeeze out every last bit of performance. Modern CPU are very good at predicting which instruction is going to be executed next and try to save us from waiting many CPU cycles just to fetch data from cache. For example access from the first level of cache takes around 3 cycles. If our prediction success is 90% it means that app is going to be executed 30% longer than if we had 100% prediction rate. And this is penalty for accessing the fastest caches. Accessing data from SSD disk can take around 300 000 cycles of doing nothing. You can read more on branch prediction in my recent blog post.

There are multiple ways how to minimize miss-prediction. The safest one is to avoid unpredictable if conditions (50% chance of executing) these kind of conditions can mess up prediction quite quickly. This often happens when using switch statements. CPU always predict that branch is not going to be used but eventually one of them has to be used which means that for every call to switch statement we get at least one miss-prediction. Switch statements can be usually easily replaced by perfect hash maps. Performance gain for using hash maps is not guaranteed and has to be tested on a real data. However, hash maps helped to improve prediction rate from 96% to 100% and increased speed by 10%. Performance gain could be smaller than expected, but computing hash could be costly.

Conclusion

Writing any kind of program with focus on speed is not easy and often requires good knowledge of solved problem and used language including standard library. I hope you enjoyed my summary of writing Javascript lexer and see you next time when discussing Javascript evaluator