Of bytes and encoded strings

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

String handling is quite different between Python 2 and Python 3, largely because the latter is designed to handle Unicode natively. But those differences make it more difficult for Python-2-based projects to make the move to Python 3. A proposal to add a feature to the language that would make that transition easier has led to a series of massive threads on the python-dev mailing list. A seemingly simple idea has interesting ramifications, both technical and philosophical.

A bit of an introduction to Python string types is probably in order. Both Python 2 and Python 3 have two types that hold string-like data, but they are quite different. Python 2 has str and unicode . Its str contains 8-bit data that could be binary data or encoded text data in some unknown encoding. Typically, it is hoped that the unknown encoding will be ASCII compatible (i.e. shares the same mappings as 7-bit ASCII), but that is not guaranteed. On the other hand, unicode strings (e.g. u'foo') contain a series of Unicode code points.

For Python 3, though, the str type contains a sequence of Unicode code points, while the bytes type (e.g. b'foo') contains a sequence of integer values in the range 0–255, without any further interpretation of those values. The language removes the ambiguous string type in favor of clear boundaries between Unicode strings and binary data (which might, in fact, be encoded text). Also, unlike Python 2, the two string types are not interoperable. unicode + bytes is not defined and will raise an exception. So, in order to get external data (from a file or socket, say) into a Python 3 str , it must be decoded from a known encoding (e.g. "ascii", "utf-8", etc.), which is quite different from the Python 2 behavior. This is all covered in much more detail in various places including Nick Coghlan's Python 3 Q&A document, this section in particular.

Various projects that are looking into moving to Python 3 have stumbled over a particular problem: formatting (or interpolating) values into bytes arrays. The following is illegal in Python 3 today:

b'%d' % 42

bytes

b'42'

TypeError

TypeError

b'%s' % 'foo' b'%s' % b'foo'

str

bytes

%

bytes

PEP 460

One might expect that to create thestring, but instead it gives aexception. Similarly, both of the following giveexceptions:The former would require an implicit encoding of the Unicodeinto, which is exactly the kind of "magic" that Python 3 is trying to avoid. The failure of the latter is perhaps the most surprising, but there is no formatting operator ("") defined for

Both the Mercurial and Twisted projects have requested support for certain kinds of bytes formatting to help in their Python 3 porting efforts. That is what led to an RFC post from Victor Stinner that contained a draft of Python Enhancement Proposal (PEP) 460. The proposal would add string-like interpolation to the Python 3 bytes type. Both the " % " operator and the format() method would be supported for a wide range of formatting characters. In Stinner's original proposal, " %s " would only be supported for numeric or bytes arguments. It was targeted as a change for Python 3.5, which is due sometime late in 2015.

There were multiple folks commenting on the idea in the thread, both for and against the idea. For example, Mark Shannon thinks that adding formatting for bytes strings, particularly when converting things like numbers to their ASCII representation, violates "the separation of str and bytes in the first place". On the other hand, Marc-Andre Lemburg agrees with Stinner's proposal as it makes things much easier in "situations where you have to work on data which is encoded text using multiple (sometimes even unknown) encodings in a single data chunk. Think MIME messages, mbox files, diffs, etc."

With Stinner's approval, Antoine Pitrou reworked PEP 460 to restrict the feature to a small subset of the formatting codes (" %s " for bytes and " %c " for single integers in the 0–255 range). The rework also added support for the bytearray type (mutable bytes ). It got an immediate +1 from Coghlan, who was "initially dubious about the idea". Others were not happy to see the removal of numeric formatting. As Ethan Furman pointed out, allowing ASCII (in the form of numeric formatting) to be interpolated into the bytes type does not mean that the type is ASCII compatible:

No, it implies that portion of the byte stream is ASCII compatible. And we have several examples: PDF, HTML, DBF, just about every network protocol (not counting M$), and, I'm sure, plenty I haven't heard of.

At some level, it comes down to a question of "language purity" and how far to push the separation between str and bytes . In a fairly long message, Stephen Hansen described what currently needs to be done for HTTP's Content-Length header:

headers.append((b"Content-Length": ("%d" % (len(content))).encode("ascii"))) Or something. In the middle of processing a stream, you need to convert this number into a string then encode it into bytes to just represent the number as the extremely common, widely-accessible 7-bit ascii subset of its numerical value. This isn't some rare, grandiose or fiendish undertaking, or trying to merge Strings and Bytes back together: this is the simple practical recognition that representing a number as its ascii-numerical value is actually not at all uncommon. It seems those who are declaring a purity priority in bytes/string separation think it reasonable to do things like:Or something. In the middle of processing a stream, you need to convert this number into a string then encode it into bytes to just represent the number as the extremely common, widely-accessible 7-bit ascii subset of its numerical value. This isn't some rare, grandiose or fiendish undertaking, or trying to merge Strings and Bytes back together: this is the simple practical recognition that representing a number as its ascii-numerical value is actually not at all uncommon. This position seems utterly astonishing in its ridiculousness to me. The recognition that the number "123" may be represented as b"123" surprises me as a controversial thing, considering how often I see it in real life.

But Coghlan is tired of seeing repeated arguments that ultimately lead to what he sees as a dangerous blurring of the line between str and bytes . One possible solution that he sees is in the nascent asciicompat project, which would add a new type (first as an extension, then perhaps to the language core) that had some of the properties being advocated in the thread.

PEP 461

The discussion in that thread (and others) got more heated than most in python-dev. Eventually, Furman forked the PEP into PEP 461, adding back the numeric formatting. Brett Cannon characterized the split as one that is between people intent on "explicit is better than implicit" (PEP 460) versus those who are focused on "practicality beats purity" (PEP 461). Both of those statements come from The Zen of Python, so it comes down to a question of the priority each person puts on them.

As it turns out, though, Python benevolent dictator for life (BDFL) Guido van Rossum is clearly in the "practicality beats purity" camp. He noted that there already is some inherent ASCII bias in the bytes type. Others had earlier pointed out some similar examples, but Van Rossum succinctly summarized the case:

But this does not mean the bytes type isn't allowed to have a noticeable bias in favor of encodings that are ASCII supersets, even if not all bytes objects contain such data (e.g. image data, compressed data, binary network packets, and so on). IMO it's totally fine and consistent if b'%d' % 42 returns b'42' and also for b'{}'.format(42) to return b'42'. There are numerous places where bytes are already assumed to use an ASCII superset: byte literals: b'abc' (it's a syntax error to have a non-ASCII character here)



the upper() and lower() methods modify the ASCII letter positions



int(b'42') == 42, float(b'3.14') == 3.14



That summary seemed to cut through a lot of the argument (Van Rossum's BDFL role plays a part in that, of course). The only real sticking point seems to be what should be returned for constructs like:

b'%s' % 'x'

Van Rossum thinks it should work as follows:

b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x' enclosed in single quotes)

Not everyone is happy with that, but it is in keeping with an existing behavior, he said:

It's symmetric with how '%s' % b'x' returns "b'x'". Think of it as payback time. :-)

While that symmetry had some appeal, it was seen as unnecessary and potentially a source of hard-to-track-down bugs. So, after another lengthy discussion, most seemed to settle for a restricted version of the " %s " specifier: it will only accept arguments that provide the buffer protocol or has a __bytes__() method, otherwise a TypeError will be raised. In practice, that will restrict %s to some internal CPython array types and to bytes ; str and numeric types do not have __bytes__() methods.