First, have a decent regression- and functional-test suite for your program. If you don’t have one, write one now. This may sound like a step you can skip, but if you do…count on it that the Dread God Finagle and his mad prophet Murphy will cause you much more pain later in the process than you think you’re avoiding now.

The first step is to run through the porting-issues checklist in the previous section. Doing these changes while the code is still running under Python 2 should not cause any issues under Python 3, but to be extra sure you should finish this step by temporarily changing your shebang line to #!/usr/bin/env python3 and running your tests.

This procedure assumes you are starting with your shebang line as #!/usr/bin/env python2 so you nail down what version you are testing with.

The reason you wanted this as a separate step is so that you can do the next bit - actually making it run under 3 - with the serious string-vs.-unicode issues separated from the syntax tweaks and import munging that 2to3 does for you.

At the end of this step, you should have a kind of amphibian - a working 2.7 program, passing your regression tests, that does Python 3 imports when run under 3 and does binary (implicitly byte-buffer) I/O. However, this amphibian probably will not run correctly under Python 3.

Your objective at this stage is not yet to move fully to 3, so it’s possible you might need to back out some 2to3 patchbands that make incompatible changes relating to strings and unicode. Save these, you’ll want them for a later stage.

Warning: In some Python 3 versions getstatusoutput() returns status incorrectly so that a nonzero exit looks like the subprocess was signaled! (Observed under 3.4.3; Debian bug #764848) It is likely this will not affect your program unless you are trying to distinguish between these cases.

Sometimes you need to be a bit more fine-grained. For example, in Python 2 getstatusoutput() is a method of the commands library module; in Python 3 there is no commands and the method moves to subprocess. You can get around this by writing

map and zip are builtins in Python 2 as well as Python 3, but in Python 2 they return lists, not generators. So you have to use the itertools spelling to get the generator behavior in both versions.

The xrange builtin exists in Python 2 and has the same behavior as the range builtin in Python 3; but you can’t use the range spelling because range also exists as a builtin in Python 2, but it returns a list, not an opaque, immutable sequence. A similar strategy can be used with generators:

It’s good practice to make names look like Python 3’s where possible, as in the above example; this will minimize code churn if you ever decide to leave Python 2 support behind. However, in some cases, the spelling you have to use to keep your code running on both Python 2 and Python 3 will look like the Python 2 spelling instead of the Python 3 one. For example:

A similar piece of magic autoadapts to the name change between Python 2 raw_input and Python 3 input. The only difference here is that Python 2 also defines input as a builtin, so to avoid colliding with it, you have to pick your own name that will point to the right function in both Python versions, and write, for example

Apply a similar pattern to all the other simple library name changes; try to import the Python 3 version, and if that fails substitute in the Python 2 version.

For example, if your program uses the Python 2 ConfigParser library, 2to3 is going to change this to the Python 3 name configparser. What you need to do is yank that out and add this code snippet just after your general imports:

Run 2to3 on your program, apply the patch it generates, and (this is important) partially revert what it does so the result runs correctly under Python 2.

In Python 3, the next method of iterators is renamed to __next__. 2to3 does this renaming. For compatibility to Python 2, you should add a method alias

pies is an alternative to six you may want to consider. See pie’s README on github for the details.

It also allows you to use metaclasses both in Python 2 and Python 3, even if the syntax between the two is completely different.

Fix up string/unicode mixing

Now it’s time to tweak your shebang line to #!/usr/bin/env python3 and make that work.

This is going to consist mostly of adding encode() and decode() calls to change data between string and unicode types. This is the heavy lifting in your Python 3 port. Because of the ASCII-compatible, 0x80..0xff-preserving encoding you’ve chosen, these will be no-ops under Python 2.

The art here is in doing as little work as possible. Your encode() and decode() calls should intercept your binary I/O close to where it happens, so the bulk of your code is just seeing Unicode strings.

This is also the stage at which you may need to tag some literals with a prefix b for byte-buffer. Beware, if you have a lot of these it may mean you have not put encode/decode calls near enough to the natural choke points where your binary I/O is happening.

You may have to fix up string-to-byte concatenations as well. Again, you’ll minimize effort by moving these conversions as close to the I/O source or sink of the data as possible so that the interior computations are always done with Unicode strings.

Here is an error message you may see during conversion:

TypeError: str does not support the buffer interface You handed a string to a function that was expecting a byte-buffer object. A common example of this is passing a string to the write method of a file object you opened in binary mode. To fix this, encode the string value to latin-1

It is worth noting that the above strategy, using encode() and decode() calls with no other checking, relies on two key properties of Python 2’s handling of byte-buffer and Unicode strings:

In string operations, you can mix byte-buffers and Unicode strings, and Python 2 will silently convert between them whenever it needs to. This allows the same code to run on both Python 2 and Python 3 without having to worry about the fact that under Python 2, a single operation might be mixing byte-buffers and Unicode strings (for example, calling format or using the % operator with a string literal as the format and data strings that are actually Unicode).

The str and unicode objects both have encode and decode methods (unlike Python 3, where only str has encode and only bytes has decode); a str encodes to itself, and a unicode decodes to itself. This allows you to avoid doing explicit isinstance checks to make sure you don’t call encode or decode on the wrong type of object; since your code will be mixing byte-buffers and Unicode strings, you won’t always be able to keep track of which type is being operated on at a particular point in your code.

If the above makes you nervous, however, there is a trick that avoids having to use Unicode at all under Python 2. Consider this snippet:

# Any encoding that preserves 0x80...0xff through round-tripping from byte # streams to Unicode and back would do, latin-1 is the best known of these. import io binary_encoding = 'latin-1' if str is bytes: # Python 2 polystr = str polybytes = bytes polyord = ord polychr = str else: # Python 3 def polystr(o): if isinstance(o, str): return o if isinstance(o, bytes): return str(o, encoding=binary_encoding) raise ValueError def polybytes(o): if isinstance(o, bytes): return o if isinstance(o, str): return bytes(o, encoding=binary_encoding) raise ValueError def polyord(c): "Polymorphic ord() function" if isinstance(c, str): return ord(c) else: return c def polychr(c): "Polymorphic chr() function" if isinstance(c, int): return chr(c) else: return c def make_std_wrapper(stream): "Standard input/output wrapper factory function" # This ensures that the encoding of standard output and standard # error on Python 3 matches the binary encoding we use to turn # bytes to Unicode in polystr above # newline="

" ensures that Python 3 won't mangle line breaks # line_buffering=True ensures that interactive command sessions work as expected return io.TextIOWrapper(stream.buffer, encoding=binary_encoding, newline="

", line_buffering=True) sys.stdin = make_std_wrapper(sys.stdin) sys.stdout = make_std_wrapper(sys.stdout) sys.stderr = make_std_wrapper(sys.stderr)

Under Python 2, str and bytes both refer to the same type object, so all that happens is that polystr and polybytes are aliased to that type object. But under Python 3, the polystr and polybytes functions are the equivalent of the encode and decode calls described above. So if you use polystr whenever you want to decode incoming data, and polybytes whenever you want to encode outgoing data, then under Python 2 your code will be using byte strings everywhere; it will only do Unicode conversions under Python 3. The only thing you need to decide is what to do if these functions receive an argument that isn’t a string at all. The above functions raise an exception, which is probably what you want if you want to make sure the functions only get used for the specific purpose of string data conversion. But there might be use cases where it makes sense to do something else.

(There are also polyord() and polychr() function in this wrapper; polyord() prevents the lossage that otherwise happens when calling ord() on an element of a byte buffer in Python 3, while polychr() prevents problems going in the opposite direction.)

Another item in this code snippet is worth noting: under Python 3, when constructing the alternate I/O streams (note that the above snippet doesn’t do anything to them under Python 2), you have to set the newline parameter to

, as shown, or you will have problems with the way Python handles line breaks in your data. There are actually two issues here. The first is newline translation: by default, Python 3 opens text files in “universal newlines” mode, in which it automatically translates all non-Unix newline markers it finds (i.e., DOS-style \r

newlines and MAC-style \r newlines) into its chosen newline marker for internal operations, which is the Unix newline,

. Once the translation is done, there’s no way to recover the original newlines. Obviously you don’t want this default behavior.

You can stop Python from translating line breaks when reading files by passing any value for the newline parameter except None. However, when writing files, Python will translate newlines if you pass anything but a blank string '' or

as the newline parameter. If you pass None, or accept Python’s default behavior, any

characters will get translated to the system default line separator, os.linesep, on writing. If you pass the DOS or MAC newline, any

characters will get translated to that newline. This behavior is rather counterintuitive; you might think that, if your data has all DOS newlines, you would want to tell Python that by passing newline="\r

" when writing a file. In fact, what that will do is make Python translate every \r

to \r\r

when writing the file! This is because, when writing, Python just looks at the

, interprets it as a newline (since that’s its internal newline character, as above), and translates it to \r

.

Further, there is the second issue, which is string operations. If you pass Python anything but

as the newline parameter to a file you open for reading, then line-related operations on that file, such as readlines() or for line in file, will break lines at markers other than

. However, once the incoming data from that file is stored as a Unicode string, any line-related operations on that string, such as splitlines(), will only break lines at

. So using anything other than

as the newline parameter creates a mismatch between the way Python processes text files and the way it processes text strings.

You could conceivably try to work through all this, but it’s much better to just avoid the problem by using

as the newline parameter for all files, and accepting that all your program’s internal data will be using

as the newline marker, in accordance with Python’s internal data model.