About Unicode

On one good day, mankind invented electronic computers. They were adopted as a useful tool by various people all over the world. Everyone was happy.

But then, these people wanted to send their work, stories and funny jokes to each other in the form of text. And unfortunately, these silly people never managed to agree on using one language and writing system. That's alright, but they also did not agreed on using one encoding format for their text.

Legacy Encodings

You see, computers don't think in letters and lines, but in bits, bytes and numbers. So we humans had to come up with a way to represent text as a bunch of numbers… Easy! Just map a symbol from your writing system of choice to a number that a computer can handle. For example, in the text you are reading right now, the lowercase latin alphabet, a to z, is caried by the numbers 97 to 122 inclusive. But not everyone with a computer can read latin. These guys use scripts like Cyrillic, Greek, Arabic, Han, Katanka and Kanji, and they used the same numbertrick to encode their text. For example, the Cyrillic letter Д (De) is encoded as the number 196 in the Windows-1251 encoding where in the ASCII Extended encoding it is '─', a vertical line used to draw boxes.

So before opening a text file sent by your friend on the other side of the world, you first have to know which encoding they used to write that file. You can guess, but there are a lot of these old encodings, almost 60! So when you open the file, it might look something like this:

rcÁ=Óì·ÇX[ésô´-åøJ¦@e¹ýD0ý["§¬Æ¢eSSÔQ¦cûcWUnÛÇ/k5©¶§.ÖÒLÝSTtaöD:<%d·%ÄÜI/y¡þ,Ô± f.ÉIlÃ»Ò£èfm mÎãSìnÕÏiàÇvø[å]/_~5L5°ô¬6×áÇÉL=É:²÷ÆTçß¨ìUEAwÜ[Í4±© Ñ³PEbÌ1%m"f_½ O`_¥Hð¢ÍA.©Ütæì²²_yèä'4ùuý¬ÇÿtMF-à9Òô@×I÷"ÊtÂî5&Ã½¥Zásq®^mTÕ{§bÀÜ´øFÁ³q0Ï¢®](é4

The correct term for this is mojibake.

UTF-8

So how do we fix the problem of having all these different encodings? Well, the answer is quite simple: just create a single big encoding that contains all the things and symbols. This is Unicode. Unicode specifies ranges of so called code-points or characters. It is not the actual encoding itself, that would be one of the formats used to encode Unicode points. The most commonly used Unicode format is UTF-8. There are other formats like UTF-16 and UTF-32, but UTF-8 is the most awesome format for Unicode because:

It can store anything.

Old encodings would require the entire document to be written using the same encoding and thus writing system, inhibiting the user from using multiple writing systems in a single document. With UTF-8 users can.

Old encodings would require the entire document to be written using the same encoding and thus writing system, inhibiting the user from using multiple writing systems in a single document. With UTF-8 users can. It uses a variable character length.

Unicode has room for up to 4,294,967,296 (4 bytes = 32 bits = 2^32) code-points. Documents and websites would become four times as large when each character is 4 bytes big. UTF-8 will use just one byte for most latin characters and up to four bytes for less common characters.

Unicode has room for up to 4,294,967,296 (4 bytes = 32 bits = 2^32) code-points. Documents and websites would become four times as large when each character is 4 bytes big. UTF-8 will use just one byte for most latin characters and up to four bytes for less common characters. It is backwards compatible with ASCII

The first 255 code-points encoded by UTF-8 are exactly the same as ASCII. ASCII was a widely used format before UTF-8 became popular. By being backwards compatible, UTF-8 programs can handle files encoded in ASCII without having to re-encode them.

You may still encounter documents or websites that do not use UTF-8. Most likely on websites that use an eastern language like Mandarin. Why? Because the website owners don't want to blow their money on bandwidth costs. Sounds strange? Well, let me explain. Documents with a lot of latin characters are smaller in size since the most common characters are only one byte big. Other writing systems are encoded in Unicode ranges where a single character can be up to four bytes big! So these documents are just bigger than documents written in a language that uses Latin characters. An example of this phenomenon is the Russian social website, vk.com. This website uses the Windows-1251 encoding since it encodes the Cyrillic writing system in one-byte wide characters, saving bandwidth.

Nowadays, UTF-8 is the most commonly used text format on the internet. It is also the backbone of this website, without Unicode, Lenny and all the other dongers would probably be limited to the ASCII character range.