In the previous post, we have started exploring CPython in order to find a way to turn the default base of integer literals in Python source code from decimal to hexadecimal. (The last post ended with a short recap. Feel free to check it out for a fast recall.)

Without further preparations, we would continue right where we have stopped last time.

So, we found out parsetok in Parser\parsetok.c does the tokenizing and the parsing. For this purpose, it receives a pointer to a tok_state struct (i.e. a tokenizer struct) that contains (among others) the Python source code string. parsetok is a little big, but we are not intimidated:

/* Parse input coming from the given tokenizer structure. Return error code. */ static node * parsetok(struct tok_state *tok, grammar *g, int start, perrdetail *err_ret, int *flags) { parser_state *ps; node *n; ... if ((ps = PyParser_New(g, start)) == NULL) { ... } ... for (;;) { char *a, *b; int type; size_t len; char *str; ... type = PyTokenizer_Get(tok, &a, &b); if (type == ERRORTOKEN) { err_ret->error = tok->done; break; } ... len = b - a; /* XXX this may compute NULL - NULL */ ... if (len > 0) strncpy(str, a, len); str[len] = '\0'; ... if ((err_ret->error = PyParser_AddToken(ps, (int)type, str, tok->lineno, col_offset, &(err_ret->expected))) != E_OK) { ... } } if (err_ret->error == E_DONE) { n = ps->p_tree; ps->p_tree = NULL; ... } else n = NULL; ... PyTokenizer_Free(tok); return n; }

Cool.

First, PyParser_New is called to create a parser_state struct (i.e. a parser struct), which also contains an empty CST. Then, in a loop, PyTokenizer_Get is called to get the next token’s string and type (in my humble opinion, ‘token_str_start_ptr’ and ‘token_str_end_ptr’ would have been more suitable names for the variables ‘a’ and ‘b’). If the token is valid (type != ERRORTOKEN), PyParser_AddToken is called to add the token to our CST. When there are no more tokens left, the tokenizing and parsing are completed. Subsequently, the tokenizer is freed, and the CST is returned.

We search for ‘PyTokenizer_Get’, and find it in Parser\tokenizer.c:

int PyTokenizer_Get(struct tok_state *tok, char **p_start, char **p_end) { int result = tok_get(tok, p_start, p_end); ... return result; }

Ok, we go straight to tok_get (which is also in Parser\tokenizer.c), and…

Oh my.

tok_get is almost 500 lines of code. This is it. The tokenizing function. This is going to be a hell of a dive…

Well, actually we don’t feel like drowning today, so we would split it to some smaller dives:

/* Get next token, after space stripping etc. */ static int tok_get(struct tok_state *tok, char **p_start, char **p_end) { int c; ... /* Get indentation level */ if (tok->atbol) { ... tok->atbol = 0; for (;;) { c = tok_nextc(tok); if (c == ' ') ... else if (c == '\t') { ... } else if (c == '\014') /* Control-L (formfeed) */ ... else break; } tok_backup(tok, c); ... } tok->start = tok->cur; /* Return pending indents/dedents */ if (tok->pendin != 0) { if (tok->pendin < 0) { tok->pendin++; return DEDENT; } else { tok->pendin--; return INDENT; } } ... /* Skip spaces */ do { c = tok_nextc(tok); } while (c == ' ' || c == '\t' || c == '\014'); /* Set start of current token */ tok->start = tok->cur - 1; /* Skip comment */ if (c == '#') while (c != EOF && c != '

') c = tok_nextc(tok); /* Check for EOF and errors now */ if (c == EOF) { return tok->done == E_EOF ? ENDMARKER : ERRORTOKEN; }

First, if the tokenizer’s atbol (which stands for ‘at begin of line’) flag is set, spaces and tabs are counted. This is done by calling tok_nextc repeatedly to get the next char from the tokenizer, until a char other than a space or a tab is encountered, and then calling tok_backup to restore the extra char consumed by tok_nextc.

If any erroneous indentation is spotted, ERRORTOKEN is returned (I have removed those checks).

Otherwise, if the indentation of this line is bigger or smaller than the last one, either INDENT or DEDENT is returned respectively.

After that, tok_nextc is again called repeatedly in order to skip spaces and tabs. There are some states in which we might or might not reach this spaces-skipping code:

This token is at the beginning of a line: This line’s indentation is invalid, and so ERRORTOKEN is returned before we reach here. This line’s indentation is valid but different than the previous line, and so either INDENT or DEDENT is returned before we reach here. This line’s indentation is the same as the previous line, so we reach here after consuming all indentation spaces, and there aren’t any more spaces to skip. This token is in the middle or at the end of a line.

Indeed, if we encounter any spaces here, we must be in the middle or at the end of a line, where spaces are meaningless, and thus they are just skipped.

Later, everything from a ‘#’ char until a new line or until the end of the file is skipped, as it is simply a comment.

Finally (for this brief dive), if EOF is reached, tok_get returns.

Just to make sure it does what we think it does, we search for ‘tok_nextc’, and find its definition and tok_backup‘s definition next to each other, also in Parser\tokenizer.c. tok_nextc is quite long, but its comment is good enough for us:

/* Get next char, updating state; error code goes into tok->done */ static int tok_nextc(struct tok_state *tok) { ... } /* Back-up one character */ static void tok_backup(struct tok_state *tok, int c) { if (c != EOF) { if (--tok->cur < tok->buf) Py_FatalError("tok_backup: beginning of buffer"); if (*tok->cur != c) *tok->cur = c; } }

tok_backup is kind of straight forward. tok->cur is decremented, but if it was already pointing to the beginning of the buffer, something, obviously, is terribly wrong, so a fatal error is raised. Now, in case the previous char is not already the char we wanted to restore, it is overwritten. We are having trouble figuring out why that should ever happen, but whatever.

Back to tok_get, it seems like we are finally starting to deal with chars that aren’t white-spaces:

... /* Identifier (most frequent token!) */ ... if (is_potential_identifier_start(c)) { /* Process b"", r"", u"", br"" and rb"" */ ... while (1) { if (!(saw_b || saw_u) && (c == 'b' || c == 'B')) ... else if (!(saw_b || saw_u || saw_r) && (c == 'u' || c == 'U')) ... else if (!(saw_r || saw_u) && (c == 'r' || c == 'R')) ... else break; c = tok_nextc(tok); if (c == '"' || c == '\'') goto letter_quote; } while (is_potential_identifier_char(c)) { ... c = tok_nextc(tok); } tok_backup(tok, c); ... *p_start = tok->start; *p_end = tok->cur; ... return NAME; }

We take a quick look at is_potential_identifier_start and is_potential_identifier_char, which turn out to be two simple macros (also defined in Parser\tokenizer.c), that do exactly as their names claim.

#define is_potential_identifier_start(c) (\ (c >= 'a' && c <= 'z')\ || (c >= 'A' && c <= 'Z')\ || c == '_'\ || (c >= 128)) #define is_potential_identifier_char(c) (\ (c >= 'a' && c <= 'z')\ || (c >= 'A' && c <= 'Z')\ || (c >= '0' && c <= '9')\ || c == '_'\ || (c >= 128))

Back to tok_get, if is_potential_identifier_start returns true, a clever while loop checks whether it is actually some combination of a string or bytes literal prefix followed by an apostrophe or a quotation mark. If it is, it could only be a string or bytes literal, so we jump to letter_quote, which would treat the token as a potential string or bytes literal.

Now that we know this token must be an identifier or a keyword, we consume chars until we reach the end of the token. This is done by calling tok_nextc and is_potential_identifier_char repeatedly, until is_potential_identifier_char returns false. Subsequently, tok_backup is called to restore the extra char that was consumed.

At this point, we have the whole token, so we can determine whether this is a valid ‘async’ or ‘await’ keyword (which is done by some checks I have removed). Otherwise, it must be an identifier or another keyword, so NAME is returned.

We wonder why ‘async’ and ‘await’ receive such a special treatment, as it seems any other keyword (e.g. ‘if’, ‘else’) would be classified as a NAME token. We could probably find some smart answer in a PEP, but we would leave that for another time.

Anyway, we continue exploring tok_get:

/* Newline */ if (c == '

') { ... return NEWLINE; } /* Period or number starting with period? */ if (c == '.') { ... return DOT; } /* Number */ if (isdigit(c)) { if (c == '0') { /* Hex, octal or binary -- maybe. */ c = tok_nextc(tok); if (c == '.') goto fraction; if (c == 'j' || c == 'J') goto imaginary; if (c == 'x' || c == 'X') { /* Hex */ c = tok_nextc(tok); if (!isxdigit(c)) { tok->done = E_TOKEN; tok_backup(tok, c); return ERRORTOKEN; } do { c = tok_nextc(tok); } while (isxdigit(c)); } else if (c == 'o' || c == 'O') { /* Octal */ c = tok_nextc(tok); if (c < '0' || c >= '8') { tok->done = E_TOKEN; tok_backup(tok, c); return ERRORTOKEN; } do { c = tok_nextc(tok); } while ('0' <= c && c < '8'); } else if (c == 'b' || c == 'B') { /* Binary */ c = tok_nextc(tok); if (c != '0' && c != '1') { tok->done = E_TOKEN; tok_backup(tok, c); return ERRORTOKEN; } do { c = tok_nextc(tok); } while (c == '0' || c == '1'); } else { int nonzero = 0; /* maybe old-style octal; c is first char of it */ /* in any case, allow '0' as a literal */ while (c == '0') c = tok_nextc(tok); while (isdigit(c)) { nonzero = 1; c = tok_nextc(tok); } if (c == '.') goto fraction; else if (c == 'e' || c == 'E') goto exponent; else if (c == 'j' || c == 'J') goto imaginary; else if (nonzero) { tok->done = E_TOKEN; tok_backup(tok, c); return ERRORTOKEN; } } }

Next, if the token is a new line or a period, NEWLINE or DOT is returned, respectively.

And then…

Unbelievable.

We actually got to where tok_get identifies a NUMBER token.

isdigit is called to check whether the first char of the token is a digit. If it is, then it could only be a number. First thing first, if this char is a zero, we would check for some special cases of number literals that start with a zero.

We call tok_nextc to get the next char, and check whether it is a dot. If it is, it could only be a fraction, so we jump to the code that handles fraction literals.

Then, we check whether the char following the leading zero is the letter ‘j’. If it is, it is the imaginary number zero, so we jump the code that handles imaginary number literals (which probably does kind of nothing, as the letter ‘j’ must be the last char in an imaginary number literal).

Later, we check whether our number token starts with any of the three prefixes: Hex, octal or binary. If indeed it starts with any of those prefixes, tok_nextc is called again, and the next char of the token is checked. If the char is invalid in that number base, tok_backup is called to restore the invalid char (it is not a part of this token, even though it is invalid), and ERRORTOKEN is returned. Otherwise, it must be a valid NUMBER token, so tok_nextc is called repeatedly to consume all following digits (in that number base), and reach the end of the token.

Now we have the required knowledge to understand the following behavior:

>>> 0x123g File "<stdin>", line 1 0x123g ^ SyntaxError: invalid syntax >>> 0xg File "<stdin>", line 1 0xg ^ SyntaxError: invalid token

In the first one, the tokenizer identified the NUMBER token ‘0x123’ and the NAME token ‘g’. Then CPython tried to make sense of the syntax, but failed, and so raised an error saying ‘invalid syntax’.

In the second one, the tokenizer identified a token starting with a hex prefix (‘0x’), and concluded it must be a NUMBER token, but then realized the hex prefix is followed by a char which is not a hex digit. Therefore, it raised an error saying ‘invalid token’.

Back to tok_get.

If the starting zero is not of a prefix, it is a leading zero in a NUMBER token, which is exactly the same as multiple leading zeros in a NUMBER token, so we might as well call tok_nextc repeatedly to consume all leading zeros.

After that, tok_nextc and isdigit are called repeatedly to consume all decimal digits, until a dot (which means it is a fraction), the letter ‘e’ (which means it is a number with an exponent part) or the letter ‘j’ (which means it is an imaginary number). If the decimal digits are followed by any of these 3, we jump to the appropriate code.

Wait a moment… We have already checked for a dot and the letter ‘j’ earlier! Looks like the first time was completely redundant. (I have opened an issue about that in CPython’s bug tracker.)

At last, if the token is a non-zero number that starts with leading zeros, and it is not any of those 3 special cases, tok_backup is called to restore the extra char that was consumed, and ERRORTOKEN is returned.

This sounds a little weird, so we try it out in our interpreter, and realize that indeed everything works exactly like that:

>>> 00000004 File "<stdin>", line 1 00000004 ^ SyntaxError: invalid token >>> 00000004e3 4000.0 >>> 00000004j 4j >>> 00000004. 4.0 >>> 00000004.3 4.3 >>> 0000000 0 >>> 0000000.0 0.0

Maybe the ‘maybe old-style octal’ comment is related to that odd behavior. We google ‘python PEP octal’, and the first result is PEP 3127, which explains that in the ancient Python 2 (the wording is mine, of course), leading zeros in a number literal were the same as adding the ‘0o’ octal prefix. The old and wise core developers had decided this behavior had been confusing, and deprecated it.

It seems a little weird that numbers with an exponent part, fractions and imaginary numbers are still allowed to start with leading zeros, but whatever.

All right, so we are done with numbers that start with a zero. Let’s go back to tok_get, and examine the way other numbers are treated:

else { /* Decimal */ do { c = tok_nextc(tok); } while (isdigit(c)); { /* Accept floating point numbers. */ if (c == '.') { fraction: /* Fraction */ do { c = tok_nextc(tok); } while (isdigit(c)); } if (c == 'e' || c == 'E') { int e; exponent: e = c; /* Exponent part */ c = tok_nextc(tok); if (c == '+' || c == '-') { c = tok_nextc(tok); if (!isdigit(c)) { tok->done = E_TOKEN; tok_backup(tok, c); return ERRORTOKEN; } } else if (!isdigit(c)) { tok_backup(tok, c); tok_backup(tok, e); *p_start = tok->start; *p_end = tok->cur; return NUMBER; } do { c = tok_nextc(tok); } while (isdigit(c)); } if (c == 'j' || c == 'J') /* Imaginary part */ imaginary: c = tok_nextc(tok); } } tok_backup(tok, c); *p_start = tok->start; *p_end = tok->cur; return NUMBER; } ... }

If this else block is reached, the token starts with a decimal digit other than zero, which means it could only be a decimal number. So tok_nextc and isdigit are called repeatedly to consume all following decimal digits.

If the next char is a dot, it must be a fraction, and so tok_nextc and isdigit are again called repeatedly to consumed all decimal digits of the fractional part.

Then, if the next char is the letter ‘e’, it might be a NUMBER token with an exponent part. Now, there are some options:

The letter ‘e’ is followed by a plus or a minus, which means it must be a number with an exponent part: The plus or minus is followed by a decimal digit, i.e. this token is definitely a NUMBER token with a valid exponent part. The plus or minus is followed by a char which is not a decimal digit. This is considered illegal, so that char is restored (as it is not a part of the invalid token), and ERRORTOKEN is returned. The letter ‘e’ is followed by a char which is neither a sign symbol nor a decimal digit. This means the NUMBER token didn’t have an exponent part after all. Thus, tok_backup is called twice, to restore both that char and the letter ‘e’, and NUMBER is returned. The letter ‘e’ is followed by a decimal digit, which means it is indeed a NUMBER token with a valid exponent part.

If we reach the do-while loop after the else-if block, it is already known to be a NUMBER token with a valid exponent part, so tok_nextc and isdigit are called repeatedly to consume all of the decimal digits of the exponent part.

Now we have the necessary knowledge to unravel the difference between the following errors:

>>> 123expelliarmus File "<stdin>", line 1 123expelliarmus ^ SyntaxError: invalid syntax >>> 123e+xpelliarmus File "<stdin>", line 1 123e+xpelliarmus ^ SyntaxError: invalid token

In the first one, the tokenizer determines it is the NUMBER token ‘123’ followed by the NAME token ‘expelliarmus’ (it is only later that CPython realizes this is a syntax error).

In the second one, the tokenizer identifies the potential NUMBER token ‘123e+’, and then determines it is an ERRORTOKEN, because the plus is not followed by a decimal digit.

At last, if the NUMBER token (whatever kind of a NUMBER token it is) ends with the letter ‘j’, it is an imaginary number. After confirming the NUMBER token really is an imaginary number, tok_nextc is called to consume another char. This is done because all other flows reach the shared return code with an extra char consumed, so in order to make the call to tok_backup also a part of the shared code, the imaginary number flow must align with all other flows, and consume an extra char.

And then, finally, NUMBER is returned.

Phew.

Tokenizing is not an easy task, and that was only a NUMBER token.

Hmmm… after all that exploration, we realize CPython happily accepts some strange number literals, so to make sure we didn’t get it all wrong, we try them out in the interpreter:

>>> 243.j 243j >>> 123.e2 12300.0

Whatever…

Anyway, it looks like we are ready for our first patch.

We want the tokenizer to identify a hex integer literal without any prefix as a NUMBER token. Also, we don’t want to mix hex integer literals with fractions or imaginary numbers (we don’t have to worry about mixing with numbers that have an exponent part, as the letter ‘e’ would be treated as a hex digit anyway).

Therefore, if our patched tokenizer identifies a hex integer literal without a prefix, it shouldn’t accept a dot or the letter ‘j’ as part of the token. However, if it identifies a decimal integer literal without a prefix, it should treat it as a decimal integer literal (i.e. accept a fraction and or an imaginary number).

Someway, this turned out to be quite a small patch:

else { /* origComment: Decimal */ /* orenmnComment: Hex or Decimal */ int orenmn_is_hex_int_literal = 0; do { c = tok_nextc(tok); if (isxdigit(c) && !isdigit(c)) orenmn_is_hex_int_literal = 1; // origLine: } while (isdigit(c)); } while (isxdigit(c)); // orenmnLine // origLine: { if (!orenmn_is_hex_int_literal) { /* Accept floating point numbers. */ if (c == '.') { fraction: /* Fraction */ do { c = tok_nextc(tok); } while (isdigit(c)); } if (c == 'e' || c == 'E') { int e; exponent: e = c; /* Exponent part */ c = tok_nextc(tok); if (c == '+' || c == '-') { c = tok_nextc(tok); if (!isdigit(c)) { tok->done = E_TOKEN; tok_backup(tok, c); return ERRORTOKEN; } } else if (!isdigit(c)) { tok_backup(tok, c); tok_backup(tok, e); *p_start = tok->start; *p_end = tok->cur; return NUMBER; } do { c = tok_nextc(tok); } while (isdigit(c)); } if (c == 'j' || c == 'J') /* Imaginary part */ imaginary: c = tok_nextc(tok); } }

We build our patched CPython, and get the following behavior:

>>> 2f3 ValueError: could not convert string to float: 2f3 >>> 2f3j File "<stdin>", line 1 2f3j ^ SyntaxError: invalid syntax >>> 243j 243j >>> 2f3. File "<stdin>", line 1 2f3. ^ SyntaxError: invalid syntax >>> 243. 243.0 >>> 2f3.j ValueError: could not convert string to float: 2f3 >>> 243.j 243j >>> 2f3.123e2j File "<stdin>", line 1 2f3.123e2j ^ SyntaxError: invalid syntax >>> 243.123e2j 24312.3j >>> 3e8 300000000.0 >>> 3e8a ValueError: could not convert string to float: 3e8a

Well, at least we have got some of it right (looks like hex integer literals actually don’t mix with fractions and imaginary numbers).

But why did CPython try to convert ‘2f3’ and ‘3e8a’ into floats?

This has probably happened because the functions that do the parsing (or those that do the transforming of the CST into an AST) had received a supposedly valid NUMBER token, which is not really that valid. Yet.

And thus, again, we must end this post abruptly, as it too became longer than it had any right to be. As usual, we would continue our journey on the next post.

part 4