From 🅰️ to 💤

Computers store all data in binary; a sequence of 0 s and 1 s. (Apologies if I’m teaching a 👵🏻 to suck 🥚!)

Due to their strictly mathematical nature, we have a few universally agreed standards for storing numbers in binary format. There are some subtle variations, such as big/little-endianness, and a few different approaches to storing negative/fractional values (along with some complications), but generally speaking these standardised formats have all been agreed on.

Storing language in binary, however, is a whole different ⚾ 🏟️. One of the earliest, and best-known, conversions from numbers → letters is the following:

This “ASCII” table (now preferably referred to as US-ASCII), created in 1963, was an early form of character encoding. It was designed to suit the needs of it original users; in particular, it was designed for:

The English language…

…with data entry via typewriters…

…for use in telegraphs.

As such, it includes 33 non-printing control codes; most of which are now obsolete.

Today, it’s little more than an odd quirk of history that some of these characters — including “bell” ( \a ), “vertical tab” ( \v ) and “carriage return” ( \r )— are still defined within this standard character set. (And this history is why Windows operating systems still use two characters, \r

, to define new lines; mimicking the physical action of typewriters!)

Inside a Windows machine

But alas, this table is insufficient for the wider 🌏. What about languages that…

Over the years, many new character encodings were created to solve these problems; typically by creating “multi-byte encodings” — i.e. using two 8-bit numbers to represent each character, rather than one, since 2⁸ simply isn’t big enough to store all the necessary characters for most languages.

It was a mess. Eventually, someone decided to f̶o̶r̶g̶e̶ ̶a̶ ̶r̶i̶n̶g̶ ̶t̶o̶ ̶b̶i̶n̶d̶ ̶t̶h̶e̶m̶ ̶a̶l̶l̶ create one encoding standard to unify them. Enter: Unicode!

Unicode (more specifically, UTF-8) is a superset of US-ASCII. Without going into too much detail, there are 2²¹ = 2,097,152 “available slots” for characters to be defined.

The latest Unicode standard — at the time of writing — is version 12.1 (released May 2019; supported in ruby version 2.6.3), and defines 137,994 characters.

The current draft for version 13.0 (planned for release in March 2020) expands this to 143,859.

These characters include everything from Hebrew, to Thai, to Japanese, to — you guessed it — Emoji!