Non-crashing Python 3.x output in Windows

Problem

The following little Python 3.x program just crashes with CPython 3.3.4 in a default-configured English Windows:

crash.py3

#encoding=utf-8 print( "Blåbærsyltetøy!")

H:\personal\web\blog alf on programming at wordpress\001\test>chcp 437 Active code page: 437 H:\personal\web\blog alf on programming at wordpress\001\test>type crash.py3 | display_utf8 #encoding=utf-8 print( "Blåbærsyltetøy!") H:\personal\web\blog alf on programming at wordpress\001\test>crash.py3 Traceback (most recent call last): File "H:\personal\web\blog alf on programming at wordpress\001\test\crash.py3", line 2, in print( "Blåbærsyltet\xf8y!") File "C:\Program Files\CPython 3_3_4\lib\encodings\cp437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 12: character maps to H:\personal\web\blog alf on programming at wordpress\001\test>_

Here codepage 437 is the original IBM PC character set, which is the default narrow text interpretation in an English Windows console window.

A partial solution is to change the default console codepage to Windows ANSI, which then at least for CPython matches the encoding for output to a pipe or file, and it’s nice with consistency. But also this has a severely limited character set, with possible crash behavior for any unsupported characters.

Direct console output

Unicode text limited to the Basic Multilingual Plane (essentially original 16-bits Unicode) can be output to a Windows console via the WriteConsoleW Windows API function.

The standard Python ctypes module provides access to the API:

Direct_console_io.py

import ctypes class Object: pass winapi = Object() winapi.STD_INPUT_HANDLE = -10 winapi.STD_OUTPUT_HANDLE = -11 winapi.STD_ERROR_HANDLE = -12 winapi.GetStdHandle = ctypes.windll.kernel32.GetStdHandle winapi.CloseHandle = ctypes.windll.kernel32.CloseHandle winapi.WriteConsoleW = ctypes.windll.kernel32.WriteConsoleW class Direct_console_io: def write( self, s ) -> int: n_written = ctypes.c_ulong() ret = winapi.WriteConsoleW( self.std_output_handle, s, len( s ), ctypes.byref( n_written ), 0 ) return n_written.value def __del__( self ): if not winapi: return # Looks like a bug in CPython 3.x winapi.CloseHandle( self.std_error_handle ) winapi.CloseHandle( self.std_output_handle ) winapi.CloseHandle( self.std_input_handle ) def __init__( self ): self.dependency = winapi self.std_input_handle = winapi.GetStdHandle( winapi.STD_INPUT_HANDLE ) self.std_output_handle = winapi.GetStdHandle( winapi.STD_OUTPUT_HANDLE ) self.std_error_handle = winapi.GetStdHandle( winapi.STD_ERROR_HANDLE )

Implementing input is left as an exercise for the reader.

Overriding the standard streams to use direct i/o and UTF-8.

In addition to the silly crashing behavior, the standard streams in CPython 3.x, like sys.stdout , default to Windows ANSI for output to file or pipe. In Python 2.7 this could be reset to more useful UTF-8 by reloading the sys module in order to get back a dynamically removed method that could set the default encoding. No longer in Python 3.x, so this code just creates new stream objects:

Utf8_standard_streams.py

import io import sys from Direct_console_io import Direct_console_io class Dcio_raw_iobase( io.RawIOBase ): def writable( self ) -> bool: return True def write( self, seq_of_bytes ) -> int: b = bytes( seq_of_bytes ) return self.dcio.write( b.decode( 'utf-8' ) ) def __init__( self ): self.dcio = Direct_console_io() class Dcio_buffered_writer( io.BufferedWriter ): def write( self, seq_of_bytes ) -> int: return self.raw_stream.write( seq_of_bytes ) def flush( self ): pass def __init__( self, raw_iobase ): super().__init__( raw_iobase ) self.raw_stream = raw_iobase # Module initialization: def __init__(): using_console_input = sys.stdin.isatty() using_console_output = sys.stdout.isatty() using_console_error = sys.stderr.isatty() if using_console_output: raw_io = Dcio_raw_iobase() buf_io = Dcio_buffered_writer( raw_io ) sys.stdout = io.TextIOWrapper( buf_io, encoding = 'utf-8' ) sys.stdout.isatty = lambda: True else: sys.stdout = io.TextIOWrapper( sys.stdout.detach(), encoding = 'utf-8-sig' ) if using_console_error: raw_io = Dcio_raw_iobase() buf_io = Dcio_buffered_writer( raw_io ) sys.stderr = io.TextIOWrapper( buf_io, encoding = 'utf-8' ) sys.stderr.isatty = lambda: True else: sys.stderr = io.TextIOWrapper( sys.stderr.detach(), encoding = 'utf-8-sig' ) return __init__()

Disclaimer: It’s been a long time since I fiddled with Python, so possibly I’m breaking a number of conventions plus doing things in some less than optimal way. But this was the first path I found through the jungle of apparently arbitrary io class derivations etc. It worked well enough for my purposes (in a little script to convert NRK’s HTML-format subtitles to SubRip format), so, I gather it can be useful also for you – at least as a basis for more robust and/or more general code.