It was a great standard and became a foundation of everything that was about to come.

Going international

Despite having such an extensive character range, there was one serious issue with ASCII. It was designed with English alphabet in mind. And what about German ä, ë and the rest of the umlauts? Or Polish ź, ł, ę and all the others? Or any other European language which definitely uses some special characters?

That’s when ISO joins the game

By the way, did you know that ISO stands for International Organization for Standardization, which in no way seem to shorten to ISO — pretty ironic as far as dealing with standards go.

Anyway, ISO issued a group of standards named ISO/IEC 646. These standards explained how to adapt ASCII to regional needs. It indicated 10 ASCII characters that can be replaced with special, regional characters. There were also two characters dedicated for currency symbols. Now you could have used your local currency instead of a “$”. But good luck trying to fit all 18 Polish letters in there. Or imagine you are a software developer writing in C, which was a prevailing programming language back then, and trying to write some code.

That’s how C code looks like. Quite a lot unsupported characters here…

The thing is your language standard doesn’t support any of the following characters:

# { } [ ]

And it only got worse from here…

Welcome to the 80s

Do you know what happened in the eighties? Well, probably a lot of things, but one of them was 8-bit Personal Computers getting more and more popular. From now on we can call 8 bits a byte. Anyway, this one extra bit made PCs quite different from earlier 7-bit devices. And it was pretty convenient as far as text encoding goes. Every extra bit doubles the number of available characters. There were now 128 new empty spaces to fill. And computer manufacturers didn’t wait for any standard to arise. Each of them took initiative and invented their own encoding system. We saw (well, not me personally, but I’m sure some of us did) rise of encodings such as HP Roman, Mac OS Roman, and Code Page 437 — used by IBM and in DOS.

It desperately called for help.

ISO once again

The call for help was once again heard by the guys from ISO. And what they did was (most likely) say:

Enough of this! Here is the one and ultimate standard for character encoding called ISO-8859–1.

And it was great as long as you were using Western European languages. If you were more into Central European languages there was a standard for this as well, called ISO-8859–2. Actually, ISO-8859 was a whole group of standards all the way up to the number 16.

Some people say, there’s a relevant XKCD comic for everything. 927 — Standards is relevant here.

It looks like this whole thing with creating new standards was quite fun. And Microsoft didn’t want to miss on that, so they’ve joined the bandwagon as well. They created Windows-1250, which was almost like ISO-8859–2, but not quite. And Windows-1252, which resembled ISO-8859–1, but with some additions here and there. And it didn’t stop here. More and more encoding were created, often changing and remixing existing ones.

That looks like fun. It’s probably why Microsoft joined the bandwagon.

But let’s stop for a moment and ask ourselves one important question: “why?”

I mean there is an encoding which supports western languages, there is also an encoding for eastern ones. Why would you create some strange hybrids which support a few characters from here, a few from there and throw in a few entirely new ones? Imagine you have an encoding which supports French letters. You also have another one which supports Greek alphabet. Using only one of them (you can’t use two encodings in one file) write a text in French about Greek language.

Now you see why.

And the fun didn’t quite stop here

Up to this point, we’ve been mostly working with languages using Latin letters with some regional modifications. But as you may know, there is a bunch of languages that don’t do this —people from China, Japan and Korea were all using computers as well. But how were they supposed to encode tens of thousands of Chinese characters with only 256 characters available in an 8-bit encoding?

They all agreed (that’s something new as far as encodings go) that it’s necessary to use 16 bits (otherwise known as 2 bytes) to encode all those characters. It might not look like a big difference, but keep in mind that instead od 2⁸ = 256 possibilities, you now get 2¹⁶=65536 possible characters. 65 thousand characters were just enough for encoding Asian languages. But don’t be fooled. Agreeing on code length didn’t mean agreeing on one encoding standard. The same encoding frenzy that was happening in Europe and the USA was happening here.

Can we all agree to never repeat a “war of standards” like that one?

Unicode to the rescue

It was at this moment people saw the ultimate solution to all their problems… Or did they?

To resolve all encoding issues, there was an idea to create one global, universal characters map. Unlike previous ISO encodings, it would cover all existing languages. This idea resulted in creation of the Universal Character Set (created as a norm called ISO/IEC 10646). But as you might have guessed, it was not the only implementation of such idea. Anyway, the first thing they did was create a Basic Multilingual Plane. It was a map of most common Chinese characters and letters for all other languages.

But they didn’t only create a list of all those characters. They also suggested a way to encode them, which they called UCS-2. The Chinese language needs most of the mapped characters out there and they did fine with 2 bytes. For this reason (among many others) UCS-2 also used 2 bytes — 16 bit code length (hence 2 in name). But it quickly turned out that Basic Multilingual Plane was actually quite incomplete.

And it all started with sending electrical “beeps” two centuries ago

If 2 bytes are not enough for you to encode all the characters, what do you do? That’s right. You create a new encoding with a longer code. And so the idea for UCS-4 was born.

Birth and rule of the Unicode

To turn this idea into something real, IEEE suggested a character encoding which met UCS-4 requirements. But for some reasons, which are out of the scope of this article, their encoding was not accepted as a standard. That’s when the other player took the initiative.

This other player was Unicode. Unicode is an organization (Unicode Consortium) creating the Unicode Standard. Unicode Standard is more or less the same thing as Universal Character Set with some extra rules. Unicode had their own 16-bit encoding as well, and was creating a 32-bit one. It turned out, that Unicode Standard was easier to use than Universal Character Set. More and more companies were leaning towards the simpler alternative. At this point, ISO noticed what’s happening. Once again, many competing standards were starting to exist. To prevent this, ISO negotiated with Unicode. And miraculously, they both agreed on one common solution.

This way Unicode’s encodings, known as UTF (short for Unicode Transformation Format), became a standard way to encode texts in the modern world. There are a few variants like UTF-8, UTF-16 and UTF-32 (yup, those numbers are code length) which are all used nowadays. Each of them has its advantages and disadvantages, but all of them can support millions of characters, which should be enough for many years to come…

One Encoding to rule them all, One Encoding to find them, One Encoding to bring them all and in the darkness bind them all

That was a long way humanity had to come to decide on one universal way to send a text using electricity. I had to simplify a lot of things to make it fit in one article, but the main story remained intact. The Internet nowadays is a much happier place with almost 90% of all websites using a common standard for character encoding — UTF-8.

Woah, wait! But how do they fit millions of characters into an 8-bit UTF-8? Wasn’t 8-bit ASCII only supporting 256 characters?

Well, that’s a different story, which happens to be a second part of the one you’ve just read. We’ll get into technical details of Unicode there.

Read next:

If you enjoyed this post, please don’t forget to tap ❤! You can also follow us on Facebook, Twitter and LinkedIn.