This article describes how Python string interning works in CPython 2.7.7.

A few days ago, I had to explain to a colleague what the built-in function intern does. I gave him the following example:

You got the idea but… how does it work internally?

Let’s delve into CPython source code and take a look at PyStringObject , the C structure representing Python strings located in the file stringobject.h:

According to this comment, the variable ob_sstate is different from 0 if and only if the string is interned. This variable is never accessed directly but always through the macro PyString_CHECK_INTERNED defined a few lines below:

Then, let’s open stringobject.c. Line 24 declares a reference to an object where interned strings will be stored:

In fact, this object is a regular Python dictionary and is initialized line 4745:

Finally, all the magic happens line 4732 in the PyString_InternInPlace function. The implementation is straightforward:

As you can see, keys in the interned dictionary are pointers to string objects and values are the same pointers. Furthermore, string subclasses cannot be interned. Let me set aside error checking and reference counting and rewrite this function in pseudo Python code:

Simple, isn’t it?

Why would you intern strings? Firstly, “sharing” string objects reduces the amount of memory used. Let’s go back to our first example, initially, the variables s1 and s2 reference two different objects:

After being interned, they both point to the same object. The memory occupied by the second object is saved:

When dealing with large lists with low entropy, interning makes sense. For instance, when tokenizing a corpus, we could benefit from the very heavy-tailed distribution of word frequencies in human languages to intern strings to our advantage. In the following example, we will load the play Hamlet by Shakespeare with NLTK and we will use Heapy to inspect the object heap before and after interning:

As you can see, we drastically reduced the number of allocated string objects from 31 166 to 4 529 and divided by 6.5 the memory occupied by the strings!

Secondly, strings can be compared by a O(1) pointer comparison instead of a O(n) byte-per-byte comparison.

To prove so, I have measured the time required to verify the equality of two strings as a function of their length when they are interned and when they are not. The following should convince you:

Under certain conditions, strings are natively interned. Recall the first example, if I had written foo instead of foo! , the strings s1 and s2 would have been interned “automatically”:

Before writing this blog post, I always thought that, under the hood, strings were natively interned according to a rule taking into account their length and the characters composing them. I was not far away from the truth but, unfortunately, when playing with pairs of strings built in very different ways, I could never infer what this rule exactly was. Can you?