\$\begingroup\$

Context

Some time ago I wrote a very simple SQL interpreter for Javascript for fun one weekend and have since started using it in production in a few projects for work, and as requirements change for things I'm working on, I implement new features in the library. For a year or so I've made due with a single, monstrous function that handled lexing, parsing and interpreting all at once, all while avoiding regex as much as possible as I've gotten the impression that regex is to be avoided as a parsing tool.

I'm now learning the hard way that this is not sustainable. I'm attempting to implement a proper lexer/parser before adding any new features.

Question

As someone with no experience writing a real lexer I started by reading a bunch of really dry tutorials in languages I don't really understand. The one thing they all seemed to have in common is that they check each character of the input string individually, which to me, seems tedious and unnecessary, so I decided to try it my own way, which turned out to be much, much simpler and shorter than implementations I found online.

If my implementation is so much better then why aren't other people doing something similar? There must be something wrong with this algorithm, what is it?

My Method

Have an array of Regex patterns that matches each token. Patterns are ordered by priority, for example, quoted strings come before numbers because numbers that are quoted are strings, not numbers. Loop thru each pattern and produce an array of arrays which contain all matches for every token (in the same order of priority). Each match contains: The literal text that was matched

Its position in the input string

The type of token it matches (eg STRING or NUMBER) Loop thru each token match and string them together when one match begins exactly where another one ends, in order of priority. When a match is not found at the position where the last match ended check to see if length of all matched tokens === the length of the input string. If not, throw an error.

The Code

This demo will tokenize a string and show you the details for each token when you mouseover the word or symbol.