If you don’t understand the following line of code and you want to do it, keep reading:

'\uD83D\uDCA9' === '\u{1F4A9}' && '\u{1F4A9}' === '💩' // true

Why this expression return true?

Because in Javascript what is represented by escape characters in a string beginning with:

\u

Is equal by strict equality (===) to what it represents. In this case, the following 3 expressions have strict equality among them:

'\uD83D\uDCA9' '\u{1F4A9}' '💩'

\u is used to represent Unicode characters using its code point.

Unicode: is a standard used since late 80’s to normalize the different character codifications.

is a standard used since late 80’s to normalize the different character codifications. Code point: is the number that represents a given character from Unicode.

But, why are there two different code points for

'💩'

?

Because the first one is the representation in ES5 and the second one is the representation in ES6 (the one that uses curly brackets {})

ES5: Ecmascript version 5, released in 2009. Ecmascript is the specification that defines Javascript versions.

Ecmascript version 5, released in 2009. Ecmascript is the specification that defines Javascript versions. ES6: Ecmascript version 6, released in 2015.

Why does ES5 keep working? because every new Ecmascript version has a fundamental principle: “don’t break the web”. It means that the new versions should be backward compatible to avoid the harm of millions of sites that use old versions of Javascript.

For us to dig deeper into this particular line of code, we should understand these points:

ES6 uses Unicode version 5.1, ES5 uses version 3. Unicode is composed by 17 planes, each one has 65,536 code points, that number is a result of calculating the possible permutations of 4 hexadecimal numbers (16⁴=65,536). 4 hexadecimal numbers can be represented with only 2 bytes, which are 16 bits, that’s why this number can also be calculated like this: (2¹⁶=65,536). The total number of code points is 1 million 114 thousand 112, it results from the multiplication of the number of code points in a plane, 65,536 by the 17 planes (65536*17 =1114112). The first plane is called BMP for “Basic Multilingual Plane”. The characters in this plane can be represented in many ways. For example the @ symbol:

'@' === '\x40' &&

'@' === '\u0040' &&

'@' === '\u{40}' &&

'@' === '\u{040}' &&

'@' === '\u{0040}' &&

'@' === '\u{00040}' &&

'@' === '\u{000040}' &&

'@' === '\u{0000040}'

// true

5. The remaining 16 planes are called supplementary planes or astral planes. To represent a character not in the first plane (BMP), in ES5 it requires to convert the code point to a notation of surrogate pairs as the expression

'\uD83D\uDCA9'

that concern us. This conversion in ES5 is of a significant complexity, you can see the details here. To lower this complexity is one of the improvements in the code points syntax of ES6.

6. In ES6, with the curly brackets syntax, as in the symbol we are looking at

'\u{1F4A9}'

If there are used 4 or fewer characters for the hexadecimal number between curly brackets {}, it means a code point from the first plane, the BMP. If there are more characters, those characters to the left of the last 4 characters, will indicate the plane. In this expression, the number one serves to point the plane. For this reason, these expressions are equivalents:

'\u{1F4A9}'=== '💩'

&& '\u{01F4A9}'=== '💩'

&& '\u{001F4A9}'=== '💩'

&& '\u{0001F4A9}'=== '💩'

//true

7. Unicode representation can be used as a variable:

name = 'John';

\u006E\u0061\u006D\u0065 += ' Lennon';

console.log(name); //'John Lennon'

This is possible, because this one is another strict equality:

name === \u006E\u0061\u006D\u0065 // true

However, it’s better not to abuse of this possibility. Simpler variable names should be preferred as the ASCII encoding which only has 256 characters in its extended version.

D3.js the famous library to make graphics and visualizations with data and documents, created an instructive bug, for the abuse of Unicode in the name of variables, using vars like: π, φ y λ. This worked fine when users used UTF-8 in its meta tag

<meta charset="UTF-8">

or in the script tag

<script charset="utf-8" src="d3.js"></script>

But it was a nightmare for people not using UTF-8.

I just verified in D3.js code https://d3js.org/d3.v5.js and I see that they don’t use symbols like π anymore, but names more prosaic but safer:

var pi = Math.PI

8. ES6 defined two new methods to play with Unicode:

Click on each one to see the documentation.

9. The last point, it’s important to take into account that strings with characters in astral planes or supplementary, behave in a different way to “normal strings”. For example, regular expressions don’t work as expected, neither the length of the string. If it is measured using the length property, usually it will show an anti-intuitive result:

'💩'.length // 2

To count characters in a more human way, we can transform the string to an array, for example with the split method:

'💩️'.split().length // 1

This article was to be a couple of paragraphs about Unicode in Javascript in a Slack channel of Colombian developers (Colombiadev), but it grew enough to make me prefer to write it down here.

REFERENCES: