Being a brief code fragment that makes me happy.

Thereâ€™s this little 10-byte file called 4c like so:

~/dev/rx/ 627> hexdump 4c 0000000 26 d0 96 e4 b8 ad f0 90 8d 86

These bytes are the UTF-8 encoding of a particular four-character string as described in Characters vs. Bytes.

Iâ€™m running Ruby 1.9 as checked out from svn earlier today:

~/dev/rx/ 628> ruby -v ruby 1.9.0 (2008-09-19 revision 19423) [i386-darwin9.4.0]

Thereâ€™s a new method, String#each_codepoint:

~/dev/rx/ 629> ri String#each_codepoint -------------------------------------------------- String#each_codepoint str.each_codepoint {|integer| block } => str ------------------------------------------------------------------------ Passes the +Integer+ ordinal of each character in _str_, also known as a _codepoint_ when applied to Unicode strings to the given block. "hello\u0639".each_codepoint {|c| print c, ' ' } _produces:_ 104 101 108 108 111 1593

And it works! (Disclaimer: I probably am not using the best and simplest idiom.)

~/dev/rx/ 630> irb irb(main):001:0> u = File.read('4c').force_encoding('UTF-8') => "&Ğ–ä¸­ğ��†" irb(main):002:0> u.each_codepoint {|c| printf("U+%04X

", c) } U+0026 U+0416 U+4E2D U+10346

Further background and explanation may be found here. I felt like writing back saying â€œAnd can we have ponies, too?â€�