I’ve got portions of HTML5lib working on Ruby 1.9, enough to pass Mars's unit tests. My initial reaction to Ruby 1.9’s support isn’t favorable. I definitely like Python 3K's Unicode support better. This feels closer to Python 2.5. In fact, I think I prefer Ruby 1.8’s non-support for Unicode over Ruby 1.9’s “support”.

The problem is one that is all to familiar to Python programmers. You can have a fully unit tested library and have somebody pass you a bad string, and you will fall over. An example that fails with Ruby 1.9:

[0x2639].pack('U') + "\u2639"

The error that is produced is ArgumentError: character encodings differ . The left hand side specifies packing as UTF-8. The right hand side is expressed as Unicode, which Ruby represents as #<Encoding:UTF-8> . The problem is that the left hand side is actually stored as #<Encoding:ASCII-8BIT> which is a misnomer. In many ways this mirror’s Python 2.x’s <type 'str'> vs <type 'unicode'> except that with Ruby 1.9 both Strings are the same type.

Ruby 1.9 both mitigates and compounds the problem by providing a number of implicit conversions. Sometimes. Take a look at this code which produces this output. Specifically, look at rows 2 and 4, where two Strings, of the same type, encoding, length, and value produce different results when concatenated with UTF-8 strings. This type of magic destroys any confidence I have in unit testing as a viable strategy.

Update: no magic, just a bug.

My preference would be that #<Encoding:ASCII-8BIT> be abolished, in favor of #<Encoding:ASCII-7BIT> and a separate Bytes class. Generally, programmers would only see objects of class Bytes if they do “binary” file I/O, explicitly create constants of that type, or invoke methods such as String#bytes .

Other suggestions: