I've been hacking on some Perl code that extracts data that comes from web users around the world and been stored into MySQL (with no real encoding information, of course). My goal it to generate well-formed, valid XML that can be read by another tool.

Now I'll be the first to admit that I never really took the time to like, understand, or pay much attention to all the changes in Perl's character and byte handling over the years. I'm one of those developers that, I suspect, is representative of the majority (at least in this self-centered country). I think it's all stupid and complicated and should Just Work... somehow.

But at the same time I know it's not.

Anyway, after importing lots of data I came across my first bug. Well, okay... not my first bug. My first bug related to this encoding stuff. The XML parser on the receiving end raised hell about some weird characters coming in.

Oh, crap. That's right. This is the big bad Internet and I forgot to do anything to scrub the data so that it'd look like the sort of thing you can cram into XML and expect to maybe work.

A little searching around managed to jog my memory and I updated my code to include something like this:

use Encode; ... my $data = Encode::decode('utf8', $row->{'Stuff'});

And all was well for quite some time. I got a lot farther with that until this weekend when Perl itself began to blow up on me, throwing fatal exceptions like this:

Malformed UTF-8 character (fatal) ...

My first reaction, like yours probably, was WTF?!?! How on god's green earth is this a FATAL error?

After much swearing, a Twitter plea, and some reading (thanks Twitter world!), I came across a section of the Encode manual page from Perl.

I'm going to quote from it a fair amount here because I know you're as lazy as I am and won't go read it if I just link here. The relevant section is at the very end (just before SEE ALSO) and titled UTF-8 vs. utf8.

....We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 2**32‐1 (or in the case of 64‐bit computers, 0 .. 2**64‐1) ‐‐ Programming Perl, 3rd ed. That has been the perl�s notion of UTF−8 but official UTF−8 is more strict; Its ranges is much narrower (0 .. 10FFFF), some sequences are not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et al). Now that is overruled by Larry Wall himself. From: Larry Wall Date: December 04, 2004 11:51:58 JST To: perl‐unicode@perl.org Subject: Re: Make Encode.pm support the real UTF‐8 Message‐Id: <20041204025158.GA28754@wall.org> On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote: : I�ve no problem with �utf8� being perl�s unrestricted uft8 encoding, : but "UTF‐8" is the name of the standard and should give the : corresponding behaviour. For what it�s worth, that�s how I�ve always kept them straight in my head. Also for what it�s worth, Perl 6 will mostly default to strict but make it easy to switch back to lax. Larry Do you copy? As of Perl 5.8.7, UTF−8 means strict, official UTF−8 while utf8 means liberal, lax, version thereof. And Encode version 2.10 or later thus groks the difference between "UTF−8" and "utf8". encode("utf8", "\x{FFFF_FFFF}", 1); # okay encode("UTF‐8", "\x{FFFF_FFFF}", 1); # croaks "UTF−8" in Encode is actually a canonical name for "utf−8−strict". Yes, the hyphen between "UTF" and "8" is important. Without it Encode goes "liberal" find_encoding("UTF‐8")‐>name # is �utf‐8‐strict� find_encoding("utf‐8")‐>name # ditto. names are case insensitive find_encoding("utf8")‐>name # ditto. "_" are treated as "‐" find_encoding("UTF8")‐>name # is �utf8�.

Got all that?

The sound you heard last night was me banging my head on a desk. Repeatedly.

I mean, how could I have possibly noticed the massive difference between utf8 and UTF-8? Really. I must have been on some serious crack.

Sigh!

Needless to say my code now looks more like this:

use Encode; ... my $data = Encode::decode('UTF-8', $row->{'Stuff'}); ## fuck!

Actually, I was kidding about the "fuck!" I wouldn't swear in code.

Posted by jzawodn at September 02, 2008 02:10 PM