During the last year or so, I have made some attempts to become familiar with CPython (henceforth, I would refer to the CPython implementation of the awesome Python 3 simply as CPython). I have tried different tactics:

Along the way, I have noticed some smart looking people say that the best way to understand CPython is to make some changes to it, and then see what happens. Of course I felt I was capable enough to figure it all out by simply reading stuff. Thus it turned out that a full year after starting my quest of understanding CPython, I have decided to make some patches to it.

I found the process of searching and messing with CPython sources while having a well defined purpose to be a great experience, which was fun and also quite educating. I have decided to write this walkthrough in hopes of inspiring you to try making your own patches (which would probably be much better than mine).

It would probably be easier to follow the walkthrough while looking at my patched CPython version, but I would try to make things clear even without it.

Let us start with that walkthrough right away.

Our first goal is to change the base of the representation of an ‘int’ object from decimal to hexadecimal, i.e.:

>>> 31 1f >>> 32 20

We know the ‘int’ type is implemented in C, so we have to find its C equivalents of ‘__repr__’ and ‘__str__’ in the type declaration. There must be a file in the ‘Objects’ directory for the ‘int’ type. Ah? No such file? We can easily spot floatobject.c, dictobject.c, listobject.c, but where is intobject.c?? Let’s try another approach. Search every *.c or *.h file in the code base for ‘”int”‘. Two results look quite interesting:

In Python\bltinmodule.c:

SETBUILTIN("int", &PyLong_Type);

and in Objects\longobject.c:

PyTypeObject PyLong_Type = { PyVarObject_HEAD_INIT(&PyType_Type, 0) "int", /* tp_name */

Seems like the ‘int’ type is referred to as PyLong_Type in CPython. To make sure, we google ‘int long in python 3’. One of the first results is Integer Objects – Python 3.5.1 documentation, which says very clearly:

PyTypeObject PyLong_Type This instance of PyTypeObject represents the Python integer type. This is the same object as int in the Python layer.

This site looks really useful. We should better remember it.

Anyway, seems like we found the initialization of the ‘int’ PyTypeObject in Objects\longobject.c:

PyTypeObject PyLong_Type = { PyVarObject_HEAD_INIT(&PyType_Type, 0) "int", /* tp_name */ ... long_to_decimal_string, /* tp_repr */ ... long_to_decimal_string, /* tp_str */ ... };

Great! Seems like replacing tp_repr and tp_str with our function would do the job. We search for ‘long_to_decimal_string’ to realize how our function should look. We find it in Objects\longobject.c:

static PyObject * long_to_decimal_string(PyObject *aa) { PyObject *v; if (long_to_decimal_string_internal(aa, &v, NULL) == -1) return NULL; return v; }

long_to_decimal_string receives an ‘int’ and returns a ‘str’, as expected. So we have to find a function similar to the builtin ‘hex’ function, but one that won’t add the ‘0x’ prefix. We search for ‘”hex”‘, and find in Python\clinic\bltinmodule.c.h:

#define BUILTIN_HEX_METHODDEF \ {"hex", (PyCFunction)builtin_hex, METH_O, builtin_hex__doc__},

All right. We search for ‘builtin_hex’, and find in Python\bltinmodule.c:

static PyObject * builtin_hex(PyModuleDef *module, PyObject *number) /*[clinic end generated code: output=618489ce3cbc5858 input=e645aff5fc7d540e]*/ { return PyNumber_ToBase(number, 16); }

Ah? What’s that ‘clinic’ thing? We google ‘python clinic’ and find PEP-0436 right away. This is some code generator to make the CPython developer’s life easier. Nothing we should worry about. Anyway, builtin_hex is just a wrapper of PyNumber_ToBase, which we search for and find in Objects\abstract.c:

PyObject * PyNumber_ToBase(PyObject *n, int base) { PyObject *res = NULL; PyObject *index = PyNumber_Index(n); if (!index) return NULL; if (PyLong_Check(index)) res = _PyLong_Format(index, base); else /* It should not be possible to get here, as PyNumber_Index already has a check for the same condition */ PyErr_SetString(PyExc_ValueError, "PyNumber_ToBase: index not int"); Py_DECREF(index); return res; }

Some more work is done here, but still it seems like most of the work is not done here, but in _PyLong_Format, which we find in Objects\longobject.c:

PyObject * _PyLong_Format(PyObject *obj, int base) { PyObject *str; int err; if (base == 10) err = long_to_decimal_string_internal(obj, &str, NULL); else err = long_format_binary(obj, base, 1, &str, NULL); if (err == -1) return NULL; return str; }

In our case, base is 16, so long_format_binary is the relevant function, and we find it also in Objects\longobject.c:

/* Convert an int object to a string, using a given conversion base, which should be one of 2, 8 or 16. Return a string object. If base is 2, 8 or 16, add the proper prefix '0b', '0o' or '0x' if alternate is nonzero. */ static int long_format_binary(PyObject *aa, int base, int alternate, PyObject **p_output, _PyUnicodeWriter *writer) { ...

‘add the proper prefix … if alternate is nonzero’?!?

Perfect!

Looks like we know enough to write our own tp_repr and tp_str for ‘int’:

static PyObject * orenmn_long_to_hex_string(PyObject *longObjPtr) { PyObject *hexStrReprPtr; if (-1 == long_format_binary( longObjPtr, 0x10, 0, // alternate = 0 to exclude the '0x' prefix &hexStrReprPtr, NULL)) { return NULL; } return hexStrReprPtr; }

We put it in Objects\longobject.c, just as long_to_decimal_string. Now we just replace the tp_repr and tp_str:

PyTypeObject PyLong_Type = { PyVarObject_HEAD_INIT(&PyType_Type, 0) "int", /* tp_name */ ... // origLine: long_to_decimal_string, /* tp_repr */ orenmn_long_to_hex_string, /* tp_repr */ ... // origLine: long_to_decimal_string, /* tp_str */ orenmn_long_to_hex_string, /* tp_str */

And that’s it! We build our patched CPython, and now the representation of ‘int’ is really in hex 🙂

part 2