In the previous post, we patched CPython to change the base of the representation of an ‘int’ object from decimal to hexadecimal.

In this post we would further explore CPython, in order to add some more patches to our patched CPython version. This time, our goal is to turn the default base of integer literals in Python source code from decimal to hexadecimal, i.e. we want the following behavior:

>>> hex(2f) 0x2f >>> 100 - 1 ff

We watched From Source to Code: How CPython’s Compiler Works by Brett Cannon (Now that’s a cool name) some time ago, so we kind of know the steps taken by the CPython interpreter to run a piece of Python source code:

decoding tokenizing parsing transforming the CST into an AST (in addition to Brett’s talk, you might refer to Eli’s post for more info) compiling executing the bytecode

Somewhere along these steps, there are two things we must change.

The first is the way the interpreter determines whether some characters are a number (or anything according to syntax) or a syntax error. The current behavior:

>>> 2f File "<stdin>", line 1 2f ^ SyntaxError: invalid syntax

The second is the way the interpreter converts an integer literal in the source into a Python ‘int’ object. The current behavior:

>>> 100 64 >>> 0x100 100

First thing first, we would go over the interpreter’s steps, and find out which of them are relevant to us.

Decoding is about converting the Python source code bytes from any encoding into the right format for the tokenizer and parser. But what is the right format for them? We google ‘python source encoding’, and find PEP 3120, which says that utf-8 is Python’s default source encoding. Is it because CPython internally stores the characters of a ‘str’ object as a utf-8 encoded string? We google ‘python internal string representation’, and find PEP 0393, which says that for each ‘str’, CPython checks what is the max number of bytes needed to encode any of the chars in the string. If the max number of bytes needed is 2, the internal representation of the characters of the ‘str’ would be an array of Py_UCS2 (a 16 bits char type). Similarly, if that max was 1 or 4 bytes, it would be an array of Py_UCS1 (an 8 bits char type) or Py_UCS4 (a 32 bits char type) respectively. By the way, reading PEP 0393 gives us another important hint: CPython’s equivalent of Python’s ‘str’ is PyUnicode_Type.

Anyway, this internal representation is definitely not utf-8. With our newly acquired knowledge, we might guess that decoding is about converting Python source code bytes (which is assumed to be a utf-8 encoded string unless explicitly specified otherwise) into that cool representation described in PEP 0393.

Back to business, while decoding, the interpreter doesn’t care about syntax yet, and definitely has nothing to do with converting anything into an ‘int’ object, so we won’t explore the code that does the decoding deliberately (but if we are lucky, we might accidentally find out whether our guess was right).

Tokenizing is about splitting the decoded source into tokens. According to Brett’s aforementioned talk, it seems like identifying a token isn’t like splitting ‘x=3+2’ into [‘x’, ‘=’, ‘3’, ‘+’, ‘2’], but also identifying ‘x’ as a NAME token, ‘3’ and ‘2’ as NUMBER tokens, etc. Could it be that as part of the tokenizing, a literal such as 2f is identified as a syntax error, because it doesn’t look like any valid token?

Parsing is about constructing a CST (AKA a parse tree) out of the tokens the tokenizer has produced. Could it be that at this point an integer NUMBER token is converted and stored as an ‘int’ object in a node in the CST?

Transforming the CST into an AST is quite self explanatory. Could it be that at this point an integer NUMBER node in the CST is converted and stored as an ‘int’ object in a node in the AST?

We can’t believe an integer NUMBER token would be stored in the AST in any form other than an ‘int’ object (Certainly NUMBER tokens don’t suffer from the ‘Pikachu Syndrome’ and refuse to evolve into an ‘int’ object, right?). But could we be certain about it? We google ‘python ast example’, and find yet another useful post by Eli, this time about the ast module.

Now that we know how, we use the ast module to simulate a very simple AST:

>>> import ast >>> ast.dump(ast.parse('0x10 + 0b111', mode='eval')) 'Expression(body=BinOp(left=Num(n=16), op=Add(), right=Num(n=7)))'

Great! As we thought, integers are not stored in the AST as the strings they once were in the source. Well, we still don’t really know those integers are stored as ‘int’ objects, but we know they are parsed and stored as numbers (instead of strings), and this is enough for our purpose.

Now, how can we find the pieces of code in CPython that do the tokenizing, the parsing, and the transforming of the CST into an AST?

In the previous post we found the C implementation of the builtin ‘hex’ function in Python\bltinmodule.c. This time, we search in bltinmodule.c for ‘eval’, and find builtin_eval_impl. This function must eventually take all of the aforementioned interpreter’s steps, maybe except for decoding (if our guess about the decoding was right). All right, it seems like we just have to follow carefully the execution of builtin_eval_impl, and we would certainly find the tokenizing, the parsing, and the transforming of the CST into an AST functions.

We start looking at builtin_eval_impl. Most of it deals with preparing the locals and globals, but the end of the function looks interesting:

static PyObject * builtin_eval_impl(PyModuleDef *module, PyObject *source, PyObject *globals, PyObject *locals) /*[clinic end generated code: output=7284501fb7b4d666 input=11ee718a8640e527]*/ { PyObject *result, *source_copy; const char *str; ... if (PyCode_Check(source)) { ... return PyEval_EvalCode(source, globals, locals); } ... str = source_as_string(source, "eval", "string, bytes or code", &cf, &source_copy); ... result = PyRun_StringFlags(str, Py_eval_input, globals, locals, &cf); ... return result; }

The first time the source parameter is being accessed is when it is passed to PyCode_Check. We search for ‘PyCode_Check’, and find out it is a simple macro defined in Include\code.h:

#define PyCode_Check(op) (Py_TYPE(op) == &PyCode_Type)

The macro just tells us if a CPython object’s type is PyCode_Type. This is probably the type of the ‘code’ object, but we search for ‘PyCode_Type’ just in case. Indeed, we find it in Objects\codeobject.c, and it is as we thought:

PyTypeObject PyCode_Type = { PyVarObject_HEAD_INIT(&PyType_Type, 0) "code", ... };

Back to builtin_eval_impl, if ‘eval’ received a ‘code’ object as the source parameter, it just calls PyEval_EvalCode. But we are interested in Python source code that contains integer literals, so we assume the source parameter is a ‘str’ object.

The source ‘str’ object is converted into a C string by source_as_string. Wait a moment… Convert a ‘str’ object into a C string? That means either taking the bytes of its Py_UCS1/Py_UCS2/Py_UCS4 array as is, or converting the ‘str’ into some encoding. We search for ‘source_as_string’, and find it also in bltinmodule.c:

static const char * source_as_string(PyObject *cmd, const char *funcname, const char *what, PyCompilerFlags *cf, PyObject **cmd_copy) { const char *str; ... if (PyUnicode_Check(cmd)) { ... str = PyUnicode_AsUTF8AndSize(cmd, &size); ... } ... return str; }

First, the cmd parameter (the source) is passed to PyUnicode_Check. Earlier, we realized CPython’s equivalent of Python’s ‘str’ is PyUnicode_Type. We google ‘PyUnicode_Type’, and find Unicode Objects and Codecs – Python 3.5.1 documentation. We search there for PyUnicode_Check, and find:

int PyUnicode_Check (PyObject *o) Return true if the object o is a Unicode object or an instance of a Unicode subtype.

So if the received source is a ‘str’ object (or an instance of a ‘str’ subclass), it is converted into utf-8, and the utf-8 encoded string is returned. Hmmm… Looks like our guess was wrong. It seems like the right format for the tokenizer and parser is actually a utf-8 encoded string.

Back to builtin_eval_impl, the ‘str’ source is converted into a utf-8 encoded string, and passed to PyRun_StringFlags, along with the globals and locals. Which means PyRun_StringFlags still has to do all of the job following the decoding. We search for ‘PyRun_StringFlags’, and find it in Python\pythonrun.c:

PyObject * PyRun_StringFlags(const char *str, int start, PyObject *globals, PyObject *locals, PyCompilerFlags *flags) { ... mod = PyParser_ASTFromStringObject(str, filename, start, flags, arena); if (mod != NULL) ret = run_mod(mod, filename, globals, locals, flags, arena); ... return ret; }

If we had to guess, we would say PyParser_ASTFromStringObject does everything from decoded source to AST, and run_mod receives the AST and does all the rest.

We search for ‘PyParser_ASTFromStringObject’ and find it too in Python\pythonrun.c:

/* Preferred access to parser is through AST. */ mod_ty PyParser_ASTFromStringObject(const char *s, PyObject *filename, int start, PyCompilerFlags *flags, PyArena *arena) { ... node *n = PyParser_ParseStringObject(s, filename, &_PyParser_Grammar, start, &err, &iflags); ... if (n) { ... mod = PyAST_FromNodeObject(n, flags, filename, arena); PyNode_Free(n); } ... return mod; }

We guess that the PyParser_ParseStringObject does everything from decoded source to CST, and the PyAST_FromNodeObject transforms the CST into an AST. We search for ‘PyAST_FromNodeObject’, find it in Python\ast.c, and give it a quick look that confirms our guess:

/* Transform the CST rooted at node * to the appropriate AST */ mod_ty PyAST_FromNodeObject(const node *n, PyCompilerFlags *flags, PyObject *filename, PyArena *arena) { ... }

Awesome! Even though diving right into PyAST_FromNodeObject is quite tempting, it would probably be better for us to explore the interpreter’s steps in order. Thus, we would first complete our quest of finding the tokenizer and the parser, and only then go back to PyAST_FromNodeObject.

We search for ‘PyParser_ParseStringObject’, and find it in Parser\parsetok.c:

node * PyParser_ParseStringObject(const char *s, PyObject *filename, grammar *g, int start, perrdetail *err_ret, int *flags) { struct tok_state *tok; ... if (*flags & PyPARSE_IGNORE_COOKIE) tok = PyTokenizer_FromUTF8(s, exec_input); else tok = PyTokenizer_FromString(s, exec_input); ... return parsetok(tok, g, start, err_ret, flags); }

Seems like we have found the functions that do the tokenizing, right? We search for ‘PyTokenizer_FromUTF8’, and find its definition right next to the definition of PyTokenizer_FromString, in Parser\tokenizer.c:

/* Set up tokenizer for string */ struct tok_state * PyTokenizer_FromString(const char *str, int exec_input) { struct tok_state *tok = tok_new(); ... str = decode_str(str, exec_input, tok); ... tok->buf = tok->cur = tok->end = tok->inp = (char*)str; return tok; } struct tok_state * PyTokenizer_FromUTF8(const char *str, int exec_input) { struct tok_state *tok = tok_new(); ... tok->input = str = translate_newlines(str, exec_input, tok); ... tok->str = str; ... strcpy(tok->encoding, "utf-8"); ... tok->buf = tok->cur = tok->end = tok->inp = (char*)str; return tok; }

We are disappointed to find out that each of these two just sets up a tok_state (i.e. a tokenizer struct), that would only be used later to do the tokenizing.

Also, it seems like our second guess about the decoding (i.e. the right format for the tokenizer and parser is a utf-8 encoded string) was right. In PyTokenizer_FromString the source string isn’t already a utf-8 encoded string, so decode_str is called, and in PyTokenizer_FromUTF8 it is already a utf-8 encoded string, so no decoding is needed.

To be on the safe side, we take a quick look at decode_str (which we also find in Parser\tokenizer.c):

/* Decode a byte string STR for use as the buffer of TOK. Look for encoding declarations inside STR, and record them inside TOK. */ static const char * decode_str(const char *input, int single, struct tok_state *tok) { PyObject* utf8 = NULL; const char *str; ... tok->input = str = translate_newlines(input, single, tok); ... if (tok->enc != NULL) { utf8 = translate_into_utf8(str, tok->enc); ... str = PyBytes_AsString(utf8); } ... tok->decoding_buffer = utf8; /* CAUTION */ return str; }

Indeed, decode_str converts a C string into a utf-8 encoded string. This ‘CAUTION’ comment is kind of scary, but we assume it doesn’t really matter to us.

Back to PyParser_ParseStringObject, we must conclude that parsetok (also defined in parsetok.c) does the tokenizing and the parsing on its own.

This post has become much longer than I have expected, so we would end this post with a short recap, and continue our journey on the next post.

We found some things that would help us reach our goal:

Sometime before the construction of the AST is finished, integer literals are parsed and stored as numbers (specifically, as Python ‘int’ objects, we guess). PyAST_FromNodeObject in Python\ast.c transforms the CST into an AST. parsetok in Parser\parsetok.c does the tokenizing and the parsing.

We also learned some other interesting things:

Python’s default source encoding is utf-8, and decoding is actually converting source code encoded in any other encoding into a utf-8 encoded string. The internal representation of the characters of a ‘str’ object’s is a Py_UCS1/Py_UCS2/Py_UCS4 array, according to the max number of bytes needed to encode any of the chars in the string (as described best in PEP 0393).

part 3