Part 1: Tokenizer

In this series, we will develop a new scripting language and describe that process step by step.

The first question that spontaneously comes to the mind of any wondering reader is likely to be: “Do we really need a new programming language?”

Do We Really Need a New Programming Language?

To answer this question, I will allow myself a small digression.

Imagine you are an architect (an actual brick-and-mortar architect, not a software one), and you are unlucky enough to be born in a very bureaucratic country. You have a great idea for a shopping mall in your underdeveloped hometown, so you create the project and apply for a building permit. Of course, they immediately reject you on the grounds that you don’t have a registered company. So, you register a company. In order to do that, you need to deposit some money. Then, you come up with a name for your company, which is rejected. On your fifth try, it’s accepted, and your application goes to the bottom of the heap. Ultimately, you either give up or realize that someone else built a shopping mall in the meantime.

But we are not genuine architects. We are software engineers, and we have the privilege of bringing our ideas to life with no permits or bureaucracy. The only thing we need is spare time and the will to spend it on programming instead of sudoku puzzles.

If you still insist that the motivation for programming cannot be pure curiosity, let me just mention that the first programming language that I designed was developed as a result of necessity, not mere curiosity. However, that shouldn’t be the first motivation for reading this blog. I think that many ideas that you will encounter here are fairly interesting and innovative to keep you interested even if you don’t actually need to create your own programming language.

Our goal of developing a small-footprint scripting language inspired me to name it “Stork”; and luckily, we don’t need to convince any bureaucrat to approve the name.

I am going to develop the programming language as we go through the series, so there is a possibility that I will develop some bugs as well. They should be ironed out as we approach the end of the series.

The complete source code for everything described here is freely available at GitHub.

Finally, to answer the question from the title of this paragraph—no, we don’t actually need a new programming language, but since we are trying to demonstrate how to make a programming language in C++, we will be creating one for demonstration purposes.

Tokenizer’s Little Helpers

I don’t know if other programmers face the same issue on a regular basis, but I face this problem quite frequently:

I need a key-value container that should retrieve values fast, in logarithmic time. However, once I initialize the container, I don’t want to add new values to it. Therefore, std::map<Key, Value> (or std::unordered_map<Key, Value> ) is overkill, as it allows fast insertion as well.

I am completely against unnecessary optimization, but in this case, I feel like a lot of memory is wasted on nothing. Not only that, but later, we will need to implement a maximal munch algorithm on top of such a container, and map is not the best container for that.

The second choice is std::vector<std::pair<Key,Value> > , sorted after insertions. The only problem with that approach is lesser code readability as we need to keep in mind that the vector is sorted, so I developed a small class that assures that constraint.

(All functions, classes, etc. are declared in the namespace stork . I will omit that namespace for readability.)

template <typename Key, typename Value> class lookup { public: using value_type = std::pair<Key, Value>; using container_type = std::vector<value_type>; private: container_type _container; public: using iterator = typename container_type::const_iterator; using const_iterator = iterator; lookup(std::initializer_list<value_type> init) : _container(init) { std::sort(_container.begin(), _container.end()); } lookup(container_type container) : _container(std::move(container)) { std::sort(_container.begin(), _container.end()); } const_iterator begin() const { return _container.begin(); } const_iterator end() const { return _container.end(); } template <typename K> const_iterator find(const K& key) const { const_iterator it = std::lower_bound( begin(), end(), key, [](const value_type& p, const K& key) { return p.first < key; } ); return it != end() && it->first == key ? it : end(); } size_t size() const { return _container.size(); } };

As you can see, the implementation of this class is quite simple. I didn’t want to implement all possible methods, just the ones that we will need. The underlying container is a vector , so it can be initialized with a pre-populated vector , or with initializer_list .

The tokenizer will read characters from the input stream. At this stage of the project, it is hard for me to decide what the input stream will be, so I will use std::function instead.

using get_character = std::function<int()>;

I will use well-known conventions from C-style stream functions, such as getc , which returns an int instead of char as well as a negative number to signal the end of a file.

However, it is really convenient to read a couple of characters in advance, before an assumption of a token type in a tokenizer. To that end, I implemented a stream that will allow us to unread some characters.

class push_back_stream { private: const get_character& _input; std::stack<int> _stack; size_t _line_number; size_t _char_index; public: push_back_stream(const get_character& input); int operator()(); void push_back(int c); size_t line_number() const; size_t char_index() const; };

To save space, I will omit the implementation details, which you can find on my GitHub page.

As you can see, push_back_stream is initialized from the get_character function. The overloaded operator() is used to retrieve the next character, and push_back is used to return the character back to the stream. line_number and char_number are convenience methods used for error reports.

Keep in mind that char_index is not the index of the character in the current line but overall; otherwise, we would have to keep all past characters in some container to implement the push_back function correctly.

Reserved Tokens

The tokenizer is the lowest-level compiler component. It has to read the input and “spit-out” tokens. There are four types of tokens that are of interest to us:

Reserved tokens

Identifiers

Numbers

Strings

We are not interested in comments, so the tokenizer will just “eat” them, without notifying anyone.

To ensure appeal and planetary popularity of this language, we will use well-known C-like syntax. It worked quite well for C, C++, JavaScript, Java, C#, and Objective-C, so it must work for Stork as well. In case you need a refresher course, you can consult one of our previous articles covering C/C++ learning resources.

Here is the reserved tokens enumeration:

enum struct reserved_token { inc, dec, add, sub, concat, mul, div, idiv, mod, bitwise_not, bitwise_and, bitwise_or, bitwise_xor, shiftl, shiftr, assign, add_assign, sub_assign, concat_assign, mul_assign, div_assign, idiv_assign, mod_assign, and_assign, or_assign, xor_assign, shiftl_assign, shiftr_assign, logical_not, logical_and, logical_or, eq, ne, lt, gt, le, ge, question, colon, comma, semicolon, open_round, close_round, open_curly, close_curly, open_square, close_square, kw_if, kw_else, kw_elif, kw_switch, kw_case, kw_default, kw_for, kw_while, kw_do, kw_break, kw_continue, kw_return, kw_var, kw_fun, kw_void, kw_number, kw_string, };

Enumeration members prefixed with “kw_” are keywords. I had to prefix them as they are usually the same as C++ keywords. The ones without prefix are operators.

Almost all of them follow the C convention. The ones that don’t are:

- concat and concat_assign ( .. and ..= ), which will be used for concatenation

- idiv and idiv_assign ( \ and \= ), which will be used for integer division

- kw_var for variable declaration

- kw_fun for function declaration

- kw_number for number variables

- kw_string for string variables



We will add additional keywords, as needed.

There is one new keyword that merits describing: kw_elif . I am a firm believer that single-statement blocks (without curly braces) are not worth it. I don’t use them (and I don’t feel that anything is missing), except on two occasions:

When I accidentally hit semicolon immediately after a for , while , or if statement prior to the block. If I am lucky, it returns a compile time error, but sometimes, it results in a dummy if-statement and a block that always executes. Fortunately, over the years, I have learned from my mistakes, so it happens very rarely. Pavlov’s dog also learned, eventually. When I have “chained” if-statements, so I have an if-block, then one or more else-if-blocks, and optionally, an else-block. Technically, when I write else if , that’s an else block with only one statement, which is that if-statement.

Therefore, elif can be used to eliminate braceless statements completely. Whether or not we allow it is a decision that can wait for now.

There are two helper functions that return reserved tokens:

std::optional<reserved_token> get_keyword(std::string_view word); std::optional<reserved_token> get_operator(push_back_stream& stream);

The function get_keyword returns an optional keyword from the word passed. That “word” is a sequence of letters, digits, and underscores, starting with a letter or an underscore. It will return a reserved_token if the word is a keyword. Otherwise, the tokenizer will assume that it is an identifier.

The function get_operator is trying to read as many characters as possible, as long as the sequence is a valid operator. If it reads more, it will unread all extra characters it has read after the longest recognized operator.

For the effective implementation of those two functions, we need two lookups between string_view and reserved_keyword .

const lookup<std::string_view, reserved_token> operator_token_map { {"++", reserved_token::inc}, {"--", reserved_token::dec}, {"+", reserved_token::add}, {"-", reserved_token::sub}, {"..", reserved_token::concat}, /*more exciting operators*/ }; const lookup<std::string_view, reserved_token> keyword_token_map { {"if", reserved_token::kw_if}, {"else", reserved_token::kw_else}, {"elif", reserved_token::kw_elif}, {"switch", reserved_token::kw_switch}, {"case", reserved_token::kw_case}, {"default", reserved_token::kw_default}, {"for", reserved_token::kw_for}, {"while", reserved_token::kw_while}, {"do", reserved_token::kw_do}, {"break", reserved_token::kw_break}, {"continue", reserved_token::kw_continue}, {"return", reserved_token::kw_return}, {"var", reserved_token::kw_var}, {"fun", reserved_token::kw_fun}, {"void", reserved_token::kw_void}, {"number", reserved_token::kw_number}, {"string", reserved_token::kw_string} };

The get_keyword implementation is completely straightforward, but for get_operator , we need a custom comparator that will compare a given character with candidate operators, taking only the n-th character into account.

class maximal_munch_comparator{ private: size_t _idx; public: maximal_munch_comparator(size_t idx) : _idx(idx) { } bool operator()(char l, char r) const { return l < r; } bool operator()( std::pair<std::string_view, reserved_token> l, char r ) const { return l.first.size() <= _idx || l.first[_idx] < r; } bool operator()( char l, std::pair<std::string_view, reserved_token> r ) const { return r.first.size() > _idx && l < r.first[_idx]; } bool operator()( std::pair<std::string_view, reserved_token> l, std::pair<std::string_view, reserved_token> r ) const { return r.first.size() > _idx && ( l.first.size() < _idx || l.first[_idx] < r.first[_idx] ); } };

That’s an ordinary lexical comparator which takes into account just the character at position idx , but if the string is shorter, it treats it as if it had a null character at position idx , which is lesser than any other character.

This is the implementation of get_operator , which should make the maximal_munch_operator class clearer:

std::optional<reserved_token> get_operator(push_back_stream& stream) { auto candidates = std::make_pair( operator_token_map.begin(), operator_token_map.end() ); std::optional<reserved_token> ret; size_t match_size = 0; std::stack<int> chars; for (size_t idx = 0; candidates.first != candidates.second; ++idx) { chars.push(stream()); candidates = std::equal_range( candidates.first, candidates.second, char(chars.top()), maximal_munch_comparator(idx) ); if ( candidates.first != candidates.second && candidates.first->first.size() == idx + 1 ) { match_size = idx + 1; ret = candidates.first->second; } } while (chars.size() > match_size) { stream.push_back(chars.top()); chars.pop(); } return ret; }

Basically, we treat all operators as candidates at the beginning. Then, we read character by character and filter current candidates by calling equal_range , comparing only the n-th character. We don’t need to compare the preceding characters as they are already compared, and we don’t want to compare the characters that follow as they are still irrelevant.

Whenever the range is non-empty, we check if the first element in the range has no more characters (if such an element exists, it is always at the beginning of the range as the lookup is sorted). In that case, we have a legal operator matched. The longest such operator is one that we return. We will unread all the characters that we eventually read after that.

Tokenizer

Since tokens are heterogeneous, a token is a convenience class that wraps std::variant different token types, namely:

Reserved token

Identifier

Number

String

End of file

class token { private: using token_value = std::variant<reserved_token, identifier, double, std::string, eof>; token_value _value; size_t _line_number; size_t _char_index; public: token(token_value value, size_t line_number, size_t char_index); bool is_reserved_token() const; bool is_identifier() const; bool is_number() const; bool is_string() const; bool is_eof() const; reserved_token get_reserved_token() const; std::string_view get_identifier() const; double get_number() const; std::string_view get_string() const; size_t get_line_number() const; size_t get_char_index() const; };

The identifier is just a class with a single member of the std::string type. It is there for convenience as, in my opinion, std::variant is cleaner if all of its alternatives are different types.

Now, we can write the tokenizer. It will be one function that will accept push_back_stream and return the next token.

The trick is to use different code branches, based on the character type of the first character we read.

If we read the end-of-file character, we will return from the function.

If we read a whitespace, we will skip it.

If we read an alpha-numeric character (a letter, a digit, or an underscore), we will read all successive characters of that type (we will also read dots if the first character is a digit). Then, if the first character is a digit, we will try to parse the sequence as a number. Otherwise, we will use the get_keyword function to check if it is a keyword or an identifier.

function to check if it is a keyword or an identifier. If we read a quotation mark, we will treat it as a string, unescaping escaped characters from it.

If we read a slash character ( / ), we will check if the next character is a slash or an asterisk ( * ), and we will skip the line/block comment in that case.

), we will check if the next character is a slash or an asterisk ( ), and we will skip the line/block comment in that case. Otherwise, we will use the get_operator function.

Here is the tokenize function implementation. I will omit the implementation details of functions that it is calling.

token tokenize(push_back_stream& stream) { while (true) { size_t line_number = stream.line_number(); size_t char_index = stream.char_index(); int c = stream(); switch (get_character_type(c)) { case character_type::eof: return {eof(), line_number, char_index}; case character_type::space: continue; case character_type::alphanum: stream.push_back(c); return fetch_word(stream); case character_type::punct: switch (c) { case '"': return fetch_string(stream); case '/': { char c1 = stream(); switch(c1) { case '/': skip_line_comment(stream); continue; case '*': skip_block_comment(stream); continue; default: stream.push_back(c1); } } default: stream.push_back(c); return fetch_operator(stream); } break; } } }

You can see that it pushes back characters that it reads before it calls a lower-level function. The performance penalty is almost nonexistent, and the lower-level function code is much cleaner.

Exceptions

In one of his rants against exceptions, my brother once said:

“There are two kinds of people: those that throw exceptions and those that have to catch them. I am always in that sad, second group.”

I agree with the spirit of that statement. I don’t particularly like exceptions, and throwing them can make any code much harder to maintain and read. Almost always.

I decided to make an exception (bad pun intended) to that rule. It is really convenient to throw an exception from the compiler to unwind from the depths of compilation.

Here is the exception implementation:

class error: public std::exception { private: std::string _message; size_t _line_number; size_t _char_index; public: error(std::string message, size_t line_number, size_t char_index) noexcept; const char* what() const noexcept override; size_t line_number() const noexcept; size_t char_index() const noexcept; };

However, I promise to catch all exceptions in top-level code. I even added line_number and char_index members for pretty-printing, and the function that does it:

void format_error( const error& err, get_character source, std::ostream& output );

Wrapping Up

That concludes the first part of our series. Perhaps it wasn’t too exciting, but we now have a useful tokenizer, along with basic parsing error handling. Both are crucial building blocks for the more interesting stuff that I am going to write about in the coming articles.

I hope that you got some good ideas from this post, and if you want to explore the details, go to my GitHub page.