Character Encoding in JavaScript

ASCII is one of the most popular encodings still in use today. It is a 7-bit fixed-length encoding scheme that uses 7 bits to encode a character. This encoding can only encode characters in the English language and some commonly used symbols like dash (-) and period (.).

However, JavaScript engine stores string literals in the UTF-16 encoding format. UTF-16 is 16-bit variable-length encoding scheme. UTF-16 can encode all the characters in Unicode character set with 1 or 2 code units.

If you are coming from the article I mentioned before, then you know what a code unit is. The code unit is the building block of the character’s encoding representation. For UTF-16, the code unit is 16 bits. If a character needs more memory, it can add another code unit making a total of 32 bits.

Some old JavaScript engines might use UCS-2 encoding. UCS-2 is fixed-length encoding which uses only 16-bit to encode a character. Since UCS-2 and UTF-16 both use the UTF character set, their encoding is identical.

Since UTF-16 can use 16 or 32 bits to encode a character, it can encode more character than UCS-2 which only uses 16 bits of memory per character. However, the USC-2 character encoding scheme is old and obsolete.

Newer versions of the JavaScript engine uses the UTF-16 encoding scheme to encode characters in a string. As of ECMAScript 2015 (ES6), strings literals are stored in UTF-16 encoding⁰ and have full UTF-16 support.

If you are coming from the article I mentioned before, then you know what a code point is. A code point is the integer value assigned a character that is used by a program to identify the character. The code point is important because no matter how a character is encoded, this unique decimal value will always point to the same character.

For ASCII encoding, since we have 7 bits to encode a character, a total of 128 (2⁷) characters can be encoded with ASCII encoding with the value from 0 to 127. Hence the code point of the character ranges from 0 to 127. The code point of character A is 65₁₀ or 41₁₆.

For UTF-16 encoding, we can encode a character in 16 bits or 32 bits. If you look at the UTF-16 encoding table, characters with a code point from 0₁₆ to FFFF₁₆ are represented with 1 code unit (16 bits) and characters with a code point from 10000₁₆ to 10FFFF₁₆ are represented with 2 code units (32 bits).

In the UTF charset, the code point of character A is 65₁₀ or 41₁₆ and the character आ, the code point 906₁₆. Since the code point these characters fall in the 0₁₆ - FFFF₁₆ range, they can be encoded with just one code unit of UTF-16. Hence these character takes only 16 bits of memory.

However, characters with a code point greater than FFFF₁₆ need two code units. For example, emoticon character 😊 (happy face) has code point 1F60A₁₆. Hence it needs two code units or 32 bits of memory to encode.

When a character needs two code units, each code unit is called a surrogate code unit and together they are called surrogate pairs. For character 😊, the surrogate pair is D83D₁₆ DE0A₁₆.

Unicode Escape

We can store any character in a single using string literal in JavaScript, which is by putting character or sequence of characters in a single or double-quoted string. JavaScript will store this string in UTF-16 encoding for us.

However, can also use code point of a character to represent a character. For this, we need to use \u prefix followed by the hexadecimal representation of the character’s code point in the UTF-16 encoding scheme. \u prefix is called Unicode Escape character.

A code unit represented with Unicode Escape character forms Unicode Escape. Hence for characters encoded in a single code unit of UTF-16, a Unicode Escape looks like below.

'\uXXXX'

Here, XXXX is exactly 4 digit hexadecimal representation of the character’s code point. In this case, the code unit is identical to the code point.

When characters are encoded with two code units, we need two Unicode Escapes for each surrogate code unit.

'\uYYYY\uXXXX'

💡 When one or more unique escapes are put together, they are called Unicode Character Escape Sequence.

⦿ — — — — ⦿

ASCII Charset

All the characters with the code point between 0 (0₁₀) and 7F₁₆ (127₁₀) belong to the ASCII character set. Since these characters are encoded with just one code unit of UTF-16, we need only one Unicode Escape to represent these characters individually.

For example, For character A with code point 41₁₆ (65₁₀), its code unit looks exactly like its code point. Hence, its code point can be represented as 0041₁₆. Its Unicode Escape is rather simple.

var characterA = '\u0041';

console.log( characterA ); // logs: A

This applies to all the characters from the ASCII character set.

If we want to represent characters from ASCII and extended-ASCII (like ISO 8859–1) character set only, we can use \x prefix which is called as the hexadecimal escape character.

\x is followed by a single-byte hexadecimal number of 2 exact characters which is the code point of the character from Unicode character set. Hence, hexadecimal escape sequence for characters A© is as below.

console.log( '\x41\xA9' ); // logs: A©

⦿ — — — — ⦿

UTF Charset

UTF characters can either take one or two code units of UTF-16. Hence, how we write Unicode Escape depends on the code point of the character.

For character आ with code point 906₁₆, since it can be encoded in just one code unit, its Unicode Escape looks like below.

console.log( '\u0906' ); // logs: आ

The character 😊 with code point 1F60A₁₆ needs two code units of UTF-16. D83D₁₆ and DE0A₁₆ are the values of each code unit (16-bit number). Hence the Unicode Escape Sequence of this character looks like below.

console.log( '\uD83D\uDE0A' ); // logs: 😊

You can find UTF-16 code units of a character using this online tool. However, dealing with the UTF-16 character is not always easy, as we need prior knowledge of the number of code units a character can take.

Also, each code point must be exactly 4 digit hexadecimal number to represent as a Unique Escape. Else, JavaScript won’t be able to decode the escape sequence, as shown in the below example.

console.log('\u41');

Error: Uncaught SyntaxError: Invalid Unicode escape sequence console.log('\u0041');

A

ES6 provides a new way to represent a character with Unicode Escapes using the code point of the character. Using \u{} notation, a character can be represented using its hexadecimal code point.

console.log( '\u{41}' ); // logs: A

console.log( '\u{906}' ); // logs: आ

console.log( '\u{1F60A}' ); // logs: 😊

⦿ — — — — ⦿

Mixed Characters

Unique Escape is nothing but a fancy way to represent a character using the encoding information of the character. Hence, it is valid to mix regular characters in their encoded (plain) form with the Unicode Escape sequence.

console.log( '\x41\u0020\uD83D\uDE0A\u0020\u006d\u0061\u006e' );

// A 😊 man console.log( 'A \uD83D\uDE0A man' );

// A 😊 man console.log( 'A \u{1F60A} man' );

// A 😊 man

💡 You can check this ASCII table to find out the code points of the characters from the ASCII character set. Or you can use this UTF character set.

⦿ — — — — ⦿

Length of a string

In JavaScript, we use .length prototype property on a string that returns the number of characters in a string.

console.log( 'Ab'.length ); // logs: 2

However, we have been deceived by this definition. In practice, String.prototype.legth returns the total number of UTF-16 code units used to encode the string. This is demonstrated in the below example.

console.log( '😊'.length ); // logs: 2

Since the character 😊 needs two UTF-16 code units, we get 2 as the length of the string. There is no built-in method in JavaScript to return Unicode aware length of a string.

But using ES6 spread operator, an easy fix would be to convert characters to an array and get its length, as shown in the below example.