In the previous post, we have continued exploring CPython in order to find a way to turn the default base of integer literals in Python source code from decimal to hexadecimal.

We have added a patch to make the tokenizer identify a hex integer literal without any special prefix or suffix as a NUMBER token.

Somewhat as expected, the patch caused a faulty behavior in case of a hex integer literal:

>>> 2f3 ValueError: could not convert string to float: 2f3

This is expected because the tokenizer classified an invalid token as a NUMBER token, while all other parts of the interpreter are oblivious to the change (in part 2 we have gone over the steps taken by the CPython interpreter to run a piece of Python source code).

Actually, this error is probably raised by the code that tries to convert a NUMBER token into a Python numeric object (i.e. an instance of any of Python’s numeric types: ‘int’, ‘float’ and ‘complex’).

Let’s find that code.

We search for ‘could not convert string to float’, and find it in Modules\_pickle.c, Objects\floatobject.c and Python\pystrtod. The ‘pickle’ module is certainly irrelevant, so we start by examining the one in Objects\floatobject.c:

PyObject * PyFloat_FromString(PyObject *v) { ... if (end != last) { PyErr_Format(PyExc_ValueError, "could not convert string to float: " "%R", v); ... } ... }

Sounds like this function receives a ‘str’ object. To be sure, we search for ‘PyFloat_FromString’ and find its declaration in Include\floatobject.h:

/* Return Python float from string PyObject. */ PyAPI_FUNC(PyObject *) PyFloat_FromString(PyObject*);

As we thought.

Now, it sounds very unlikely for the interpreter to first convert numeric literals in the source into ‘str’ objects, and only then convert them into numeric objects. So PyFloat_FromString is probably not the one that raised that ‘could not convert string to float’ error.

We turn to look at Python\pystrtod.c:

/* PyOS_string_to_double converts a null-terminated byte string s (interpreted as a string of ASCII characters) to a float. The string should not have leading or trailing whitespace. The conversion is independent of the current locale. ... */ double PyOS_string_to_double(const char *s, char **endptr, PyObject *overflow_exception) { ... else if (!endptr && (fail_pos == s || *fail_pos != '\0')) PyErr_Format(PyExc_ValueError, "could not convert string to float: " "%.200s", s); else if (fail_pos == s) PyErr_Format(PyExc_ValueError, "could not convert string to float: " "%.200s", s); ... }

Hmmm…

Converting a string of ASCII chars into a float sounds more like something related to the handling of a numeric literal in Python source code.

We search for ‘PyOS_string_to_double’, and find it in 8 different files. However, one file stands out among all others: There is a call to PyOS_string_to_double in Python\ast.c, from the static function parsenumber.

Awesome!

In part 2 we have found that PyAST_FromNodeObject in Python\ast.c transforms the CST into an AST. Optimistic as ever, we guess that parsenumber is part of the code that transforms the CST into an AST.

But a guess is not enough, so we examine Python\ast.c more closely:

... static int validate_stmts(asdl_seq *); static int validate_exprs(asdl_seq *, expr_context_ty, int); static int validate_nonempty_seq(asdl_seq *, const char *, const char *); static int validate_stmt(stmt_ty); static int validate_expr(expr_ty, expr_context_ty); ... int PyAST_Validate(mod_ty mod) { ... } ... static PyObject *parsenumber(struct compiling *, const char *); ... /* Transform the CST rooted at node * to the appropriate AST */ mod_ty PyAST_FromNodeObject(const node *n, PyCompilerFlags *flags, PyObject *filename, PyArena *arena) { ... } mod_ty PyAST_FromNode(const node *n, PyCompilerFlags *flags, const char *filename_str, PyArena *arena) { mod_ty mod; PyObject *filename; filename = PyUnicode_DecodeFSDefault(filename_str); ... mod = PyAST_FromNodeObject(n, flags, filename, arena); ... return mod; } ... static PyObject * parsenumber(struct compiling *c, const char *s) { ... } ...

It seems like Python\ast.c consists of two quite separate parts (there aren’t any calls from one part to functions in the other part):

A part that contains functions related to validation checks, whose only non-static function is PyAST_Validate. A part that contains functions related to transforming the CST into an AST. The only non-static functions in this part are PyAST_FromNodeObject and PyAST_FromNode, while PyAST_FromNode just decodes some filename, and calls PyAST_FromNodeObject.

The second part is obviously the one we care about, so we would just ignore PyAST_Validate.

This means, in short, that the only way for parsenumber to ever be called, is through PyAST_FromNodeObject.

Very well.

This is our current hypothesis about the way in which the CPython interpreter handles numeric literals:

The tokenizer classifies a numeric literal as a NUMBER token. The parser makes a CST node for the NUMBER token, in which it stores the token in some format, while also storing the literal’s source as a utf-8 encoded string. The parser adds the CST node to the appropriate place in the CST. While the CST is converted into an AST, that CST node is parsed and entered into the AST in some form. Among others, the literal’s source as a utf-8 encoded string is passed to parsenumber, which converts it into a Python numeric object (which is actually stored in the AST node).

In the previous post, we have seen that parsetok (in Parser\parsetok.c) calls PyTokenizer_Get to get the next token, and then calls PyParser_AddToken to add the token to the CST. parsetok passes the following information (which was received from PyTokenizer_Get) to PyParser_AddToken:

the token’s type the token’s source as a utf-8 encoded string the token’s location in the source the address of some error code variable

In part 2, we have seen that (at least in the implementation of builtin ‘eval’), the CST is constructed and passed to PyAST_FromNodeObject right away.

Thus, to prove our hypothesis, we need only to:

find how a token’s source as a utf-8 encoded string is stored in a CST node verify that the string passed to parsenumber is retrieved from where the token’s source was stored

Let us start by looking at PyParser_AddToken‘s implementation.

We search for ‘PyParser_AddToken’, and find it in Parser\parser.c.

Oh boy.

It seems like PyParser_AddToken is full of compilers stuff, which I don’t understand yet. Luckily, the source as a utf-8 encoded string parameter is used only 3 times in the function, so we would just follow the references to it:

int PyParser_AddToken(parser_state *ps, int type, char *str, int lineno, int col_offset, int *expected_ret) { int ilabel; ... D(printf("Token %s/'%s' ... ", _PyParser_TokenNames[type], str)); ... /* Find out which label this token is */ ilabel = classify(ps, type, str); ... /* Loop until the token is shifted or an error occurred */ for (;;) { ... /* Check accelerator */ if (s->s_lower <= ilabel && ilabel < s->s_upper) { ... if (x != -1) { ... /* Shift the token */ if ((err = shift(&ps->p_stack, type, str, x, lineno, col_offset)) > 0) { ... } ... } } ... } }

Let’s go over the references one by one.

Ah? A function named ‘D’? Weird.

We look for ‘D’ in Parser\parser.c and find out it is actually a macro:

#ifdef Py_DEBUG extern int Py_DebugFlag; #define D(x) if (!Py_DebugFlag); else x #else #define D(x) #endif

In other words, anything inside D would run only if we are in some debug mode.

Back to PyParser_AddToken.

We continue to the second reference to the source string.

Maybe classify stores the source string somewhere…

We search for ‘classify’, and find it in Parser\parser.c:

static int classify(parser_state *ps, int type, const char *str) { ... if (type == NAME) { const char *s = str; ... for (i = n; i > 0; i--, l++) { if (l->lb_type != NAME || l->lb_str == NULL || l->lb_str[0] != s[0] || strcmp(l->lb_str, s) != 0) continue; ... ... D(printf("It's a keyword

")); return n - i; } } ... }

In case classify receives a NAME token, it copies the address of the source string, and compares it to some strings.

In the previous post, we have seen that the tokenizer classifies most keywords (all of them, actually, except for ‘async’ and ‘await’) as NAME tokens. Looks like classify is the one that receives (among others) all NAME tokens, and distinguishes between real names and Python keywords.

(Hmmm… str is copied into s for no apparent reason. I have opened an issue about that in CPython’s bug tracker.)

Back to PyParser_AddToken.

Inside some infinite loop, it seems the token is shifted by calling shift. We don’t really know what that means, but we just follow references to the source string, so we search for ‘shift’ and find it also in Parser\parser.c:

static int shift(stack *s, int type, char *str, int newstate, int lineno, int col_offset) { ... err = PyNode_AddChild(s->s_top->s_parent, type, str, lineno, col_offset); ... return 0; }

We search for ‘PyNode_AddChild’, and find it in Parser

ode.c:

int PyNode_AddChild(node *n1, int type, char *str, int lineno, int col_offset) { ... node *n; ... n = &n1->n_child[n1->n_nchildren++]; n->n_type = type; n->n_str = str; n->n_lineno = lineno; n->n_col_offset = col_offset; n->n_nchildren = 0; n->n_child = NULL; return 0; }

The next empty child node of n1 is filled with the token’s attributes, just as they are, no conversions.

Most importantly for our purpose, we now know that the token’s source as a utf-8 encoded string is stored in the n_str field of every CST node.

Ok then, so to prove our hypothesis, we only have to verify that the string passed to parsenumber is retrieved from the n_str field of a CST node.

We search for ‘parsenumber’ and find (in Python\ast.c) a single call to it:

static expr_ty ast_for_atom(struct compiling *c, const node *n) { /* atom: '(' [yield_expr|testlist_comp] ')' | '[' [testlist_comp] ']' | '{' [dictmaker|testlist_comp] '}' | NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False' */ node *ch = CHILD(n, 0); int bytesmode = 0; switch (TYPE(ch)) { case NAME: { PyObject *name; const char *s = STR(ch); size_t len = strlen(s); if (len >= 4 && len <= 5) { if (!strcmp(s, "None")) return NameConstant(Py_None, LINENO(n), n->n_col_offset, c->c_arena); if (!strcmp(s, "True")) return NameConstant(Py_True, LINENO(n), n->n_col_offset, c->c_arena); if (!strcmp(s, "False")) return NameConstant(Py_False, LINENO(n), n->n_col_offset, c->c_arena); } name = new_identifier(s, c); ... /* All names start in Load context, but may later be changed. */ return Name(name, Load, LINENO(n), n->n_col_offset, c->c_arena); } case STRING: { PyObject *str = parsestrplus(c, n, &bytesmode); ... if (bytesmode) return Bytes(str, LINENO(n), n->n_col_offset, c->c_arena); else return Str(str, LINENO(n), n->n_col_offset, c->c_arena); } case NUMBER: { PyObject *pynum = parsenumber(c, STR(ch)); ... return Num(pynum, LINENO(n), n->n_col_offset, c->c_arena); } case ELLIPSIS: /* Ellipsis */ ... case LPAR: /* some parenthesized expressions */ ... case LSQB: /* list (or list comprehension) */ ... case LBRACE: { ... } ... }

We can not resist the temptation of glancing over the irrelevant (for our purpose) parts of ast_for_atom.

We start with the comment at the top of the function.

Hmmm… What is this atom thing, anyway? Maybe it is another word for a literal? We google ‘python atom literal’, and find 6. Expressions — Python 3.5.1 documentation.

So identifiers, literals and enclosures are all atoms. And ast_for_atom probably handles all of these.

Looks like ast_for_atom starts by retrieving a child node of the received node, using CHILD. We search for ‘CHILD’, and find it (along with some more macros we have seen in ast_for_atom) in Include

ode.h:

/* Node access functions */ #define NCH(n) ((n)->n_nchildren) #define CHILD(n, i) (&(n)->n_child[i]) #define RCHILD(n, i) (CHILD(n, NCH(n) + i)) #define TYPE(n) ((n)->n_type) #define STR(n) ((n)->n_str) #define LINENO(n) ((n)->n_lineno)

Great! Macros to access the fields of a CST node!

Back to ast_for_atom.

Indeed, the address of the first child of the received CST node is stored in ch.

After that, there is a switch case on the type of the first child, which is probably the type given to it by the tokenizer:

A NAME CST node is checked to determine whether its source string is ‘None’, ‘True’ or ‘False’ (the check is done by using STR, which returns the source string of a CST node): If it is, a NameConstant AST node that contains the right constant is returned. Otherwise, it looks like new_identifier is called to convert the source string into a ‘str’ object. Then, a Name AST node which contains the ‘str’ object is returned. A STRING CST node is converted into either a ‘str’ or ‘bytes’ object by parsestrplus. Later, either a Str or Bytes AST node (containing the converted object) is returned. A NUMBER CST node is converted into a Python numeric object (we guess) by parsenumber. Thankfully, we note that STR is used to pass the n_str field of the CST node to parsenumber. Subsequently, a Num AST node which contains the converted object is returned. ELLIPSIS, LPAR (left parenthesis), LSQB (left square parenthesis) and LBRACE (left curly parenthesis) CST nodes are also handled, but we would leave those for another time.

Excellent.

Seems our hypothesis proved right.

Which means we only have to realize how parsenumber works, patch it, and we are done!

Finally, we turn to examine parsenumber:

static PyObject * parsenumber(struct compiling *c, const char *s) { const char *end; long x; double dx; Py_complex compl; int imflag; ... errno = 0; end = s + strlen(s) - 1; imflag = *end == 'j' || *end == 'J'; if (s[0] == '0') { x = (long) PyOS_strtoul(s, (char **)&end, 0); if (x < 0 && errno == 0) { return PyLong_FromString(s, (char **)0, 0); } } else x = PyOS_strtol(s, (char **)&end, 0); if (*end == '\0') { if (errno != 0) return PyLong_FromString(s, (char **)0, 0); return PyLong_FromLong(x); } /* XXX Huge floats may silently fail */ if (imflag) { ... return PyComplex_FromCComplex(compl); } else { dx = PyOS_string_to_double(s, NULL, NULL); ... return PyFloat_FromDouble(dx); } }

First thing first, we should give a quick look to some of the functions called here (i.e. the ones that look relevant for our purpose).

We search for ‘PyLong_FromLong’ and ‘PyLong_FromString’, and find both in Objects\longobject.c:

/* Create a new int object from a C long int */ PyObject * PyLong_FromLong(long ival) { ... } ... /* Parses an int from a bytestring. Leading and trailing whitespace will be * ignored. * * If successful, a PyLong object will be returned and 'pend' will be pointing * to the first unused byte unless it's NULL. * * If unsuccessful, NULL will be returned. */ PyObject * PyLong_FromString(const char *str, char **pend, int base) { ... }

Quite straight forward… Oh, and this base parameter looks promising.

Next, we search for ‘PyOS_strtoul’ and ‘PyOS_strtol’, and find their definitions next to each other in Python\mystrtoul.c:

/* ** strtoul ** This is a general purpose routine for converting ** an ascii string to an integer in an arbitrary base. ** Leading white space is ignored. If 'base' is zero ** it looks for a leading 0b, 0o or 0x to tell which ** base. If these are absent it defaults to 10. ** Base must be 0 or between 2 and 36 (inclusive). ** If 'ptr' is non-NULL it will contain a pointer to ** the end of the scan. ** Errors due to bad pointers will probably result in ** exceptions - we don't check for them. */ unsigned long PyOS_strtoul(const char *str, char **ptr, int base) { ... /* set pointer to point to the last character scanned */ if (ptr) *ptr = (char *)str; return result; overflowed: if (ptr) { /* spool through remaining digit characters */ while (_PyLong_DigitValue[Py_CHARMASK(*str)] < base) ++str; *ptr = (char *)str; } errno = ERANGE; return (unsigned long)-1; } ... long PyOS_strtol(const char *str, char **ptr, int base) { long result; unsigned long uresult; char sign; while (*str && Py_ISSPACE(Py_CHARMASK(*str))) str++; sign = *str; if (sign == '+' || sign == '-') str++; uresult = PyOS_strtoul(str, ptr, base); if (uresult <= (unsigned long)LONG_MAX) { result = (long)uresult; if (sign == '-') result = -result; } ... else { errno = ERANGE; result = LONG_MAX; } return result; }

The comment of PyOS_strtoul is really informative. Also, we realize that PyOS_strtoul fails in case it receives a number which is too big (to fit a C unsigned long), but it would still set the received pointer to the last character scanned (i.e. one char after the last digit char).

PyOS_strtol doesn’t have a useful comment, but it is short, so we would go over it quickly.

It starts with a while loop to skip all leading spaces in the number’s string. The first non-space char of the number is stored in sign, and is skipped in case it really is a sign symbol.

Now that it has the number stripped from the sign symbol, it just passes it to PyOS_strtoul, to do the job of converting it into a C unsigned long.

At last, the converted number (as a C unsigned long) is converted yet again (if possible), this time into a C long, which is returned.

If it isn’t possible (or if the number couldn’t be converted into a C unsigned long in the first place), the global errno is set to ERANGE (which probably means range error), and an error value is returned.

Back to parsenumber. This time for real.

First, the address of the number’s last char is calculated and stored in end. Whether this is an imaginary number literal (i.e. whether the last char is the letter ‘j’) is stored in imflag.

Then, if the number’s first char is zero, it is obviously not a negative number. Therefore, PyOS_strtoul is called to try to convert it into a C unsigned long, and the value it returns is casted into a C long.

What happens after the call to PyOS_strtoul is up to the number:

If the number is too big to fit into a C unsigned long, errno is ERANGE, and x is -1. If the number fits into a C unsigned long, but not into a C long, errno is zero, and x is negative. In this case, the following if condition is met, so PyLong_FromString is called, and its return value is returned. If the number fits into a C unsigned long as well as into a C long, errno is zero, and x is positive.

If the number’s first char is not zero, it might be negative, so PyOS_strtol is called. Luckily, as we have seen earlier, PyOS_strtol keeps it simple. So there are only two options after calling it:

The number fits into a C long, errno is zero, and x is the number. The number does not fit into a C long, errno is ERANGE, and x is some error value.

At this point, PyOS_strtoul must have been called, either explicitly, or implicitly through PyOS_strtol. Anyway, PyOS_strtoul has updated end to point to the char after the last digit it parsed.

Whatever comes next is determined by where end points to:

PyOS_strtoul has parsed the whole number, and thus end points to the null-terminator. Whatever led us here, if errno is ERANGE, the number didn’t fit into a C long, so PyLong_FromString is called, and its return value is returned. Else, the number did fit into x (a C long), and so PyLong_FromLong is called, and its return value is returned. PyOS_strtoul has not finished parsing the number because it ends with the letter ‘j’. In that case, imflag is set. Ultimately, PyComplex_FromCComplex is called, and its return value is returned. PyOS_strtoul has not finished parsing the number because it is a number with an exponent part, which begins with the letter ‘e’. This time, imflag is not set, and as a result, PyOS_string_to_double is called to convert the number into a C double. Finally, PyFloat_FromDouble is called, and its return value is returned. PyOS_strtoul has not finished parsing the number because it is a fraction, which has a dot somewhere, and the flow is the same as the one in the previous scenario. There shouldn’t be any more scenarios… unless someone actually patched CPython in some weird way, just as we did 🙂

In our patched CPython, it might be that the number PyOS_strtoul received is a hex integer literal without a prefix, but PyOS_strtoul treats it like a decimal integer literal, and stops parsing it on any of its alphabetical hex digits. In that case, imflag is not set, so just as in the two previous scenarios, PyOS_string_to_double is called. It fails, of course, as we have seen at the end of the last post, and the beginning of this one.

Hmmm…

It seems like, eventually, we are ready for the second patch.

We don’t care about the part of parsenumber that handles numbers starting with zero, nor do we care about the part that handles imaginary numbers or fractions.

We simply want integer literals without any prefix to be treated as hexadecimal numbers.

Again, our patch turns out to be quite simple:

static PyObject * parsenumber(struct compiling *c, const char *s) { const char *end; long x; double dx; Py_complex compl; int imflag; ... errno = 0; end = s + strlen(s) - 1; imflag = *end == 'j' || *end == 'J'; if (s[0] == '0') { x = (long) PyOS_strtoul(s, (char **)&end, 0); if (x < 0 && errno == 0) { return PyLong_FromString(s, (char **)0, 0); } } else // origLine: x = PyOS_strtol(s, (char **)&end, 0); x = PyOS_strtol(s, (char **)&end, 0x10); // orenmnLine if (*end == '\0') { if (errno != 0) // origLine: return PyLong_FromString(s, (char **)0, 0); return PyLong_FromString(s, (char **)0, 0x10); // orenmnLine return PyLong_FromLong(x); } /* XXX Huge floats may silently fail */ if (imflag) { ... return PyComplex_FromCComplex(compl); } else { dx = PyOS_string_to_double(s, NULL, NULL); ... return PyFloat_FromDouble(dx); } }

We build our patched CPython, and somewhat surprisingly, the default base of integer literals indeed seems to be hexadecimal.