Hacking Unicode for Unbreakable Kaomoji ᕙ(⇀‸↼⁠‶)ᕗ

Kaomoji (a.k.a. dongers or text faces) — the much cooler cousins of emoticons — are assembled by combining letters, punctuation, and other symbols from a variety of languages and alphabets. You might know the famous shrug ¯⁠\_(ツ)_/⁠¯, or of course, the infinite variations of angry table flipping (ノ⁠ಠ⁠益⁠ಠ)⁠ノ⁠彡⁠┻━┻.

Unfortunately, kaomoji have a fatal flaw: line-breaking. When deciding how to display a piece of text (such as these paragraphs, in your web browser) a program has to decide where to begin new lines.

This is a sentence whose lines break at sensible locations. Looks good! ᕕ( ᐛ )ᕗ This sentence, on the other hand, breaks at arbi trary positions. Not so good. (๑•́⁠ㅿ⁠•̀๑) ᔆᵒʳʳᵞ

Because kaomoji are not conventional alphabetical words, your computer’s line-breaking algorithms don’t realize they are supposed to stay together, and so will often split the poor things apart.

We’ve extended Dango to suggest kaomoji using the same neural nets that we use for emoji but we wanted to make sure this line-breaking issue was solved. To this end, we’ve invented a simple technique to abuse Unicode to build unbreakable kaomoji that stick together through thick and thin (column widths). You can see unbreakable kaomoji in action in the animation above, or search for your own below:





Background on Unicode

When people think of the Unicode standard (if people think of the Unicode standard) they most often think of it as a system for representing text. While it does this — Unicode assigns a number, called a ‘codepoint’, to every symbol in the majority of writing systems including extinct ones , U+65 for ‘e’, U+30C4 for ‘ツ’, and U+1F355 for ‘🍕’ — it also does a whole lot more. The standard specifies tons of other types of behaviour around the layout and display of text, including how to display ac̮ͥcent͑͝s and diͭacr͉it̊ic̕al̂͑ mark͎͑sͮ, how to segment text into characters and words, various different ways of actually encoding the text for storage, and much more.

For the purposes of building unbreakable kaomoji, we’re particularly interested in UAX 14: “Unicode Line Breaking Algorithm” which explains the (surprisingly complex) rules about where your computer can and can not automatically introduce line breaks. There are lots of subtle edge cases, for instance:

LB30 Do not break between letters, numbers, or ordinary symbols and opening or closing parentheses The purpose of this rule is to prevent breaks in common cases where a part of a word appears between delimiters—for example, in “person(s)”.

A close read of this standard provides some tools that we can use to solve the kaomoji breaking problem.

Ingredient 1: the No-Break Space U+A0

One of the obvious places where a line break is typically allowed is at the “space” character (codepoint U+20), however many kaomoji include internal spaces, and we don’t want to break them apart there. Luckily there’s another character in Unicode, the no-break space (or NBSP; codepoint U+A0), which looks identical to a regular space, but prevents line-breaking (LB12). A typical use of the NBSP would be to ensure that for a quantity with a unit like “10 km”, the “km” doesn’t separate from the “10”. By replacing every space with a no-break space, we can ensure line breaks aren’t introduced when a space occurs mid-kaomoji.

(Pro tip: if you’re on a Mac, you can type the no-break space by holding down “option” while pressing space).

Ingredient 2: the Word Joiner U+2060

Sadly the no-break space isn’t enough. There are plenty of other places where line-breaks are allowed. UAX 14 actually specifies the list of places where breaking is prohibited and everywhere else is fair game.

For instance breaking at hyphens is normally allowed, so if you had a cute little face kaomoji like this: o-o, it would be legal to break around the central hyphen.

o-o ← what a cute little face o- o ← this is less cute

However there’s another special character, the Word Joiner (or WJ; codepoint U+2060), which is defined as “a zero-width non-breaking space”. The WJ serves as glue (LB11), just like the no-break space, but is otherwise invisible.

The recipe:

With these two ingredients, the NBSP and the WJ, we’re almost there! The obvious first idea would be to replace every space with a NBSP, and put a WJ between every other pair of codepoints. Sadly this approach doesn’t work, as it inserts WJs inside character sequences that are supposed to be combined together such as accented characters. Many kaomoji rely on these combining sequences, and in many cases introducing th WJ causes them to break apart, for instance turning 凸⁠(ఠ్ఠ⁠皿⁠ఠ్ఠ ) into 凸⁠(⁠ఠ⁠్⁠ఠ⁠皿⁠ఠ⁠్⁠ఠ ).

Thankfully line-breaking inside these combining sequences is already prohibited (LB9), so we didn’t need to introduce a WJ there anyway. The approach we’ll take is to only glue those string locations where breaking would otherwise be allowed.

To find those breakable locations, we’ll need an implementation of the full line-breaking algorithm defined in UAX 14. This is pretty complex, but lucky for us there are open-source implementations for all major programming languages. In Python we can use the uniseg library.

Putting this all together we get a very nice and compact algorithm, with most of the subtlety hidden away inside uniseg .

from uniseg import linebreak def make_unbreakable (s): """Python function to take an input string and "glue" it together with NBSP and WJ characters so that it doesn't break across lines. Designed for use with kaomoji.""" # First replace every space character with a NBSP s = s . replace( u' ' , u' \u00A0 ' ) # now break the string up around the breakable locations chunks = list(linebreak . line_break_units(s)) # join these chunks together with the word-joiner character return u' \u2060 ' . join(chunks)

Interesting note: this algorithm will sometimes put a WJ before a NBSP. Although this may seem redundant, there are some cases (LB12a) where the character before an NBSP still allows for a linebreak, such as after any of the “break-after” class of characters, with fun entries like “two dots over one dot punctuation” U+2E2A (⸪).

And that’s it! With this simple code snippet, you can start building your own unbreakable kaomoji, or of course just use the ones we’ve built into Dango. You should dig into Unicode standard; it’s surprisingly interesting, not only to help you understand your computer, but to help you understand the complexities of written language.