I saw Moriel Schottlender give a talk about this topic at linux.conf.au 2016 in Geelong last month and asked her to submit an article about it. When you watch her talk video and read her article, you'll see why I couldn't take proper notes to write about the talk, so I'm glad she contributed her story directly to us instead. —Rikki Endsley

English is written left to right. Hebrew is written right to left. We know that. Browsers—for the most part—know that too, just like they know that the default directionality of a web page is left-to-right (LTR), and that if there is a setting that explicitly defines the direction to right-to-left, the page should flip like a mirror. Browsers are smart like that. Mostly.

But even browsers have problems when deciding what to do when languages are mixed up, and that, my friends, is a recipe for really weird issues when typing and viewing bidirectional text.

Before I delve into some interesting examples of mixed-up directionality problems, I should first go over how browsers consider directionality at all.

I already said that English is recognized as an "LTR" language (Left-to-Right), and Hebrew, Arabic, Urdu (and some others) as RTL languages (Right-to-Left). These are fairly clear, and if you type a string that consists of these languages on their own, the situation is more or less okay (but I'll go over some issues with that later)

But not all characters in strings are equal.**

Hebrew and English (and a couple of other languages) are of the "strong" directionality types, the ones that not only have direction but also affect their surroundings. Some characters have "weak" directionality, in that, although they have directionality internally, they don't affect characters around them. And some characters are merely neutral, which means they get their directionality by their surroundings. Oh, and there are also some characters that may (and do) flip around visually depending on the text they're in.

Don't worry. I'm going to explain eeeeeeeeeeeeeeverything. Well, I'm going to try, so just keep reading.

Unicode, which is the encoding system most common online, defines character type directionality for groups of characters as either strong, weak, or neutral. These types control how these characters are presented inside a string.

In the beginning days of the Internet—way, way back, when dinosaurs roamed the earth and half of you who are reading this post were probably in diapers—the Internet assumed pretty much everything is left-to-right.

I remember building web pages in raw HTML that most of us would cringe at today. There were no sites, really—only a collection of static HTML pages that, more often than not, included horrendous tags such as <blink> and <marquee> and featured pages in which one-font-served-all and the backgrounds were tiled. Ah, the good ol' days.

Those days, Hebrew was, in fact, typed backwards. If I wanted to write the Hebrew word "שלום", which starts with the hebrew letter "ש", I would have to type it backwards, starting with the letter "ם", and produce "םולש"—because the letters would appear sequentially from left to right. This might be doable when typing one or two words, but if you had an entire paragraph or an entire article, it could get annoying fast.

There were several tools you could download in those ancient days that would take your text and flip it. 'Cause that's how we rolled back then.

Luckily, Unicode came in and defined directionality, and although Unicode still has problems, RTL users can at least type their language normally, rather than learn to write backwards. That helps.

Strong types

Strong types are character sets that have explicit directionality. Hebrew is right-to-left, always. English is left-to-right, always.** When I type in either of those character-sets, my characters would appear in sequence, one after the other, according to the directionality. This is how the word "Hello" appears from left to right, whereas the word "שלום" appears from right to left.

Strong types also set the directionality of the space they're in, meaning that if I inserted any characters that have weak or neutral directionality in the middle of the sentence you're reading now (and I have already done that), they will assume the direction of the strongly typed string—in this case, English. So, strong type isn't just about the character itself, but also its surroundings.

Weak types

Weak types are fun. These are sequences of characters that might have a direction, but it doesn't affect their surroundings, and may be adjusted based on their surrounding text. In this group are characters such as numbers, plus and minus signs, colon, comma, period, and other control characters.

According to the Unicode bidirectionality algorithm specifications, weak types resolve according to the previous characters.

Neutral types

Neutral types are the funnest. Neutral characters are character types that can be either right-to-left or left-to-right, so they completely depend on what string surrounds them. These include things such as new-line characters, tabs, and white-space.

According to the Unicode bidirectionality algorithm specifications, neutral types resolve their directionality according to the surrounding text.

Implicit level types: When what you type is not quite what you get

So we have strong types, weak types, and neutral types, but that's not where our directionality double-take ends. In fact, the real doozies are characters that are resolved differently (as in, they take literally different shapes) in either RTL or LTR.

Yes, you read that right: They actually literally and quite visibly look different when written inside an LTR string versus inside an RTL string.

The best examples for this are parentheses and (my personal best friend) the bracket. These symbols are, in fact, icons that represent direction already. The button on your keyboard that has "(" on it is not quite that, but rather a symbol of "open parentheses." In English (which is left-to-right) the symbol is naturally ( to open parentheses, and ) to close them. But in Hebrew and Arabic and the other RTL languages, the "open parentheses" symbol is the reverse ), because the string is right to left. So this symbol would appear on your screen either ( or ) depending where you typed it.

I know, right?

Mishmashing both ways

In general, if one uses only one direction in a document (specifically online), the problems are not as noticeable, because the strongly typed text surrounds all other weak and implicit-level character types, making them its own type by default.

The issues come up when we have to mix languages and directions, or use RTL language inside a block that is meant for LTR. This happens a lot online—if there is no explicit dir="rtl" anywhere in the HTML document, the document defaults to LTR directionality. The directionality of the page (either by using dir='rtl' or dir='ltr' or not using dir= attribute at all and relying on it's default fallback to 'LTR') is considered to explicitly set the directionality of the expected text. So, any characters of ambiguous directionality will take on the direction that was set by that attribute.

If, say, I try to type an RTL language inside a textbox in a page that has dir='ltr', I can run into a lot of annoying problems with punctuation, the positions of segments of the sentence, and mixing languages of a strong type. The same happens the other way around, if I try to type an LTR language (say, English) inside an RTL-set textbox.

It can get so confusing, that, quite often, as I try to figure out how to type LTR text into an RTL box and see how my text actually organizes itself, my state of mind is pretty much blown.

The good, the bad, and the ugly

So, obviously, the creation of Unicode was much superior to the reverse-typing (and the need to use multiple individual fonts) that existed before it. Browsers tend to follow the Unicode rules (though apps that do their own rendering sometimes don't, but that's a different issue.) And this Unicode directionality algorithm gives us a lot of really Good Things to work with when typing different directions, but it also has Bad Things, and occasionally, even really Ugly Things.

Good things

There are, indeed, a bunch of good things that happen due to Unicode's bidirectionality algorithm. As I've already mentioned, the fact RTL users can type their language normally (and not backwards) is already a good thing (and I know from experience because I used the system when it didn't have that nice feature.)

Other benefits of the bidirectionality algorithm is the fact we can use numbers (which are weakly typed LTR) inside RTL text. So, for instance, consider this text:

ניפגש ב09:35 בחוף הים

Literally, this means "we will meet at 09:35 at the beach." Notice, though, that even without any directionality fixes, the numbers 09 and 35 are left-to-right as they should be, because that's how numbers are read—but I didn't really need to manually reverse my typing when I wrote this sentence. The browser did it for me.

Here's a nice exercise, though. Select that sentence. When you do, you can see exactly what piece has what directionality. Which leads me to ...

Bad things

Selections

Selections are a major part of the problem of bidirectional text. As you can see from the example of the "good thing" (that I don't need to reverse typing), there is also a bad side, which is how to select my text. Selection can be logical or visual. This is also true of cursor movement, which I will go over in a second.

