For safety and usability, xterm(1) now uses UTF-8 mode by default.

Ingo Schwarze (schwarze@) writes in to explain this change and how it improves security.

One particular example of such communication is an application program passing output text to a terminal emulator program. If the terminal uses a different encoding for decoding the text than the application used for encoding it, the terminal may see control codes where the application only intended printable characters. This can screw up the terminal state, spoiling display of subsequent text or even hanging the terminal.

Actually, i assume that this problem occurs frequently in practice, for the following reasons. If the application program is well-behaved, it either produces C/POSIX/US-ASCII output only, or its idea of the encoding to use is governed by the LC_CTYPE locale(1) environment variable, typically passed to it by the shell it was started from. Now that locale(1) environment is completely unrelated to whatever encoding the terminal may be set up for. It may not even be on the same physical machine. For example, during an SSH session, your terminal is on the local SSH client machine, while the shell starting your application programs is on the remote SSH server machine. To fully appreciate the implications, try out the following scenario: Start an xterm(1) that is not UTF-8 enabled on your local machine by saying xterm +lc +u8 . Unset LC_ALL, LC_CTYPE, and LANG; check with locale(1) that your locale is "C". Use ssh(1) to connect to a remote machine. Now simulate a program producing UTF-8 output on the remote machine, for example U+00DF LATIN SMALL LETTER SHARP S:

$ printf "\303\237

" # thanks to sobrado@ for the striking example

Now your local terminal hangs until you force a reset using the menus of the xterm program, because the '\237' byte appearing in the UTF-8 encoding of that LATIN SMALL LETTER SHARP S also is the ISO 6429 C1 control code "application program command" - it doesn't do anything useful in xterm(1), but causes subsequent bytes to be ignored until you send the "string terminator" byte '\234', which you probably won't ever do. There are literally hundreds of different control sequences that terminals may or may not respect, some more or less univeral, some highly specific for certain types of terminals, changing fonts, colors, encodings, window titles, moving windows around and resizing them, some even changing keyboard bindings, and many, many more things - some of which may actually be dangerous depending on what exactly you are using your terminal for. If the shell startup files on the remote machine set LC_CTYPE=en_US.UTF-8 or something similar by default, programs on the remote machine will always do just that, send UTF-8 encoded output over the wire that can utterly confuse your local terminal.

That shows how easy it is to inadvertently cause application-terminal character encoding mismatches; yet i doubt that many people are aware of the problem. So we should try to reduce the likelihood that people get burnt by such effects.

On an operating system supporting any third locale in addition to C/POSIX and UTF-8, people are screwed beyond rescue because even if one side of the connection assumes US-ASCII, communication is still unsafe in both directions. Reinterpreting US-ASCII in an arbitrary encoding and reinterpreting an arbitrary encoding as US-ASCII may both turn innocuous printable characters into dangerous terminal control codes. That is particularly bitter because some programs will always output US-ASCII, which is not safe to display in a terminal set up for an arbitrary locale.

Fortunately, in OpenBSD, we made the decision to only support exactly two locales, C/POSIX and UTF-8, and this combination has the following properties:

Printing unsanitized strings to the terminal is never safe, no matter the locale and terminal setup (think of cat /bsd ).

Printing sanitized US-ASCII to a US-ASCII terminal is safe.

Printing sanitized UTF-8 to a UTF-8 terminal is safe.

Printing sanitized US-ASCII to a UTF-8 terminal is safe. That is important because there are some programs that we may never want to add UTF-8 support to. However:

Printing sanitized UTF-8 to a US-ASCII terminal is *NOT* safe. Remember the example above that hung a US-ASCII terminal by printing U+00DF LATIN SMALL LETTER SHARP S in UTF-8 to it. Until this week, our xterm(1) ran in US-ASCII mode by default. In view of the above, that was a terrible idea, even if the user didn't intend to ever use UTF-8. A UTF-8 terminal handles the US-ASCII the user wants just fine, and in addition to that, and mostly for free, it is more resilient against stray UTF-8 sneaking in. Actually, even when fed garbage or unsupported encodings, a UTF-8 xterm(1) is more robust than a US-ASCII xterm(1) because the UTF-8 xterm(1) honours *fewer* terminal escape codes than the US-ASCII xterm(1). That may seem surprising at first because Unicode defines *more* control characters than US-ASCII does. But as explained on http://invisible-island.net/xterm/ctlseqs/ctlseqs.html xterm(1) never treats decoded multibyte characters as terminal control codes, so the ISO 6429 C1 control codes do not take effect in UTF-8 mode; but they do take effect in US-ASCII mode, even though they fall outside the scope of ASCII. Consequently, in the interest of safe and sane defaults, i recently switched our xterm(1) to enable UTF-8 mode by default. I did that by adding this resource to /usr/X11R6/share/X11/app-defaults/XTerm : *locale: UTF-8 The main goal is improving robustness. But it also improves usability. If you usually run your shells inside xterm(1) in C/POSIX mode, there should be few visible changes for you. But if you ever stumble upon a directory containing UTF-8 filenames, you can simply say $ LC_CTYPE=en_US.UTF-8 ls which would have given you garbage output in the past, and which just works now in OpenBSD-current. If you really insist on running xterm(1) in traditional 8-bit character mode by default like in the past - which, nota bene, isn't quite C/POSIX/US-ASCII but does many additional things you are probably unaware of - you can do so in any of the following ways. But i do not recommend that at all, there are hardly any sane use cases - maybe except using weird, probably unsafe software that insists on sending ISO 6429 C1 controls in 8-bit mode rather than encoding them as two-byte sequences with the ASCII ESCAPE character as most software implementing terminal control via terminal control codes does. If you insist against all advice, you can:

Add XTerm*locale: true to your ~/.Xresources file, or use the -lc command line option for the same effect. That will also use UTF-8 mode, but use luit(1) to transform US-ASCII to UTF-8 on input which is probably mostly a NOOP, but might expose some subtle differences. Not recommended.

Add XTerm*locale: false to your ~/.Xresources file, or use the +lc command line option for the same effect. That will inspect LC_CTYPE in the environment and use UTF-8 mode if that specifies a UTF-8 locale, and traditional 8-bit character mode otherwise. Don't forget to run xrdb ~/.Xresources after editing the file.

Add XTerm*locale: medium to your ~/.Xresources file, to get exactly the old defaults back. They do weird things, read the source code in charproc.c, function VTInitialize_locale(), lines 7385-7404 for details. Not recommended.