My confusion from yesterday was due to a bug, which was promptly fixed — test case, fix.

Now that I understand what is intended, the situation is a lot clearer. In Python 3.0, there are two types of strings, Bytes and Unicode, and the determination of the type is static. With Ruby 1.9, there is one type of string, and the associated encoding is mutable. The internal state of a given sequence of bytes with respect to the current encoding is: UNKNOWN , 7BIT , VALID , and BROKEN . UNKNOWN is a mechanism to delay the binding, and the combination of the bug and the delayed binding made the situation confusing as correctness of the result produced depended on the order of the operations performed.

The bug affected gsub! , but not sub , sub! or gsub . With the released 1.9.0 version of Ruby, gsub! the state of the resulting string was not updated. Oops. Now that that is corrected, everything works as expected, for some values of expected. Things I was not previously aware of:

Array#pack does not set the encoding. For some cases, it is arguable that the encoding could be inferred, including the common idiom of [Fixnum].pack('U') , but Ruby 1.9 makes no attempt to do so. Fixnum.chr(Encoding) is the preferred alternative.

does not set the encoding. For some cases, it is arguable that the encoding could be inferred, including the common idiom of , but Ruby 1.9 makes no attempt to do so. is the preferred alternative. String#ascii_only? and String#valid_encoding? may be used to probe the internal state of a given string. Once probed, the state is no longer UNKNOWN .

and may be used to probe the internal state of a given string. Once probed, the state is no longer . The meaning of "\xXX" depends on the encoding declared in the source file. This may be turn out to be handy.

depends on the encoding declared in the source file. This may be turn out to be handy. Locale environment variables only affect the interpretation of data files, not source files. This policy seems defensible.

In addition to "\uXXXX" , unicode strings may be expressed as "\u{X}" where X may be a space separated sequence of hex strings of any length. \u{10464 } is a Faihu character and "\u{a3 a5 20ac}" produces the pound, yen, and euro characters.

The net result of all this is that any sequence of operations that produce a runtime exception in Ruby 1.9 would also produce a runtime exception in Python 3.0. Some use cases that are entirely safe will not produce an exception in Ruby 1.9 when they would in Python 3.0. Such an approach is entirely consistent with a dynamic language.