The Split is Not Enough: Unicode Whitespace Shenigans for Rubyists

By Peter Cooper

That code is legal Ruby! If you ran it, you'd see 8 . How? There's a tale to tell..

The String with the Golden Space

I was on IRC in #nwrug enjoying festive cheer with fellow Northern Rubyists when ysr23 presented a curious problem.

He was using a Twitter library that returned a tweet, "@twellyme film" , in a string called reply . The problem was that despite calling reply.split , the string refused to split on whitespace. Yet if he did "@twellyme film".split in IRB, that was fine.

International man of mystery Will Jessop suggested checking $; (it's a special global variable that defines the default separator for String#split ). It was OK.

In an attempt to look smarter than I am, I suggested reply.method(:split).source_location to see if the String class had been monkey-patched by something annoying. Nope. (Though this is a handy trick if you do want to detect if anyone's tampered with something.)

Someone suggested Mr. Ysr23 show us reply.codepoints.to_a :

# reply.codepoints.to_a => [64, 116, 119, 101, 108, 108, 121, 109, 101, 160, 102, 105, 108, 109]

Something leapt out at me. Where was good old 32!? Instead of trusty old ASCII 32 (space) stood 160, a number alien to my ASCII-trained 1980s-model brain.

From Google with Love

To the Google-copter!

Aha! Non-breaking space. That's why split was being as useful as a chocolate teapot.

After an intense 23 seconds of discussion, we settled on a temporary solution for Mr. Ysr23 who, by this time, was busy cursing Twitter and all who sailed upon her:

reply.gsub(/[[:space:]]/, ' ').split

The solution is simple. Use the the Unicode character class [[:space:]] to match Unicode's idea of what whitespace is and convert all matches into vanilla ASCII whitespace. reply.split(/[[:space:]]+/) is another more direct option - we just didn't think of it at the time.

Quantum of Spaces

Solving an interesting but trivial issue wasn't where I wanted to end my day. I'd re-discovered an insidious piece of Unicode chicanery and was in the mood for shenanigans!

Further Googling taught me you can type non-breaking spaces directly on OS X with Option+Space. (You can do the homework for your own platform.)

I also knew Ruby 1.9 and beyond would let you use Unicode characters as identifiers if you let Ruby know about the source's encoding with a magic comment, so it was time for shenanigans to begin!

My first experiment was to try and use non-breaking spaces in variable names.

Cool! So what about variable names and method names?

What about without any regular printable characters in the identifiers at all?

And so we're back to where we started. A hideous outcome from a trivial weekend on IRC. But fun, nonetheless. Stick it in your "wow, nice, but totally useless" brain box.

A Warning

Please don't use this in production code or the Ruby gods will come and haunt you in your sleep. But.. if you want to throw some non-breaking spaces into your next pair programming session, conference talk, or job interview, just to see if anyone's paying attention, I'll be laughing with you. (And if you're a C# developer too, Andy Pike tells me that C# supports these shenanigans too.)

P.S. My Ruby 2.0 Walkthrough Kickstarter only has about 12 hours to go! Check it out if Ruby 2.0 is on your radar or you want a handy way to get up to speed when it drops in February 2013.