The proposal “RegExp Unicode Property Escapes” by Mathias Bynens is at stage 4. This blog post explains how it works.

JavaScript lets you match characters by mentioning the “names” of sets of characters. For example, \s stands for “whitespace”:

> /^\s+$/u.test('\t

\r') true

The proposal lets you additionally match characters by mentioning their Unicode character properties (what those are is explained next) inside the curly braces of \p{} . Two examples:

> /^\p{White_Space}+$/u.test('\t

\r') true > /^\p{Script=Greek}+$/u.test('μετά') true

As you can see, one of the benefits of property escapes is is that they make regular expressions more self-descriptive. Additional benefits will become clear later.

Before we delve into how property escapes work, let’s examine what Unicode character properties are.

Unicode character properties #

In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character. Quoting the Unicode Standard, Sect. 3.3, D3:

The semantics of a character are determined by its identity, normative properties, and behavior.

Examples of properties #

These are a few examples of properties:

Name : a unique name, composed of uppercase letters, digits, hyphens and spaces. For example: A: Name = LATIN CAPITAL LETTER A 😀: Name = GRINNING FACE

: a unique name, composed of uppercase letters, digits, hyphens and spaces. For example: General_Category : categorizes characters. For example: x: General_Category = Lowercase_Letter $: General_Category = Currency_Symbol

: categorizes characters. For example: White_Space : used for marking invisible spacing characters, such as spaces, tabs and newlines. For example: \t: White_Space = True π: White_Space = False

: used for marking invisible spacing characters, such as spaces, tabs and newlines. For example: Age : version of the Unicode Standard in which a character was introduced. For example: The Euro sign € was added in version 2.1 of the Unicode standard. €: Age = 2.1

: version of the Unicode Standard in which a character was introduced. For example: The Euro sign € was added in version 2.1 of the Unicode standard. Block : a contiguous range of code points. Blocks don’t overlap and their names are unique. For example: S: Block = Basic_Latin (range U+0000..U+007F) Д: Block = Cyrillic (range U+0400..U+04FF)

: a contiguous range of code points. Blocks don’t overlap and their names are unique. For example: Script : is a collection of characters used by one or more writing systems. Some scripts support several writing systems. For example, the Latin script supports the writing systems English, French, German, Latin, etc. Some languages can be written in multiple alternate writing systems that are supported by multiple scripts. For example, Turkish used the Arabic script before it transitioned to the Latin script in the early 20th century. Examples: α: Script = Greek א: Script = Hebrew

: is a collection of characters used by one or more writing systems.

Types of properties #

The following types of properties exist:

Enumerated property: a property whose values are few and named. General_Category is an enumerated property.

is an enumerated property. Closed enumerated property: an enumerated property whose set of values is fixed and will not be changed in future versions of the Unicode Standard.

Boolean property: a closed enumerated property whose values are True and False . Boolean properties are also called binary, because they are like markers that characters either have or not. White_Space is a binary property.

and . Boolean properties are also called binary, because they are like markers that characters either have or not. is a binary property. Numeric property: has values that are integers or real numbers.

String-valued property: a property whose values are strings.

Catalog property: an enumerated property that may be extended as the Unicode Standard evolves. Age and Script are catalog properties.

and are catalog properties. Miscellaneous property: a property whose values are not Boolean, enumerated, numeric, string or catalog values. Name is a miscellaneous property.

Matching properties and property values #

Properties and property values are matched as follows:

Loose matching: case, whitespace, underscores and hyphens are ignored when comparing properties and property values. For example, "General_Category" , "general category" , "-general-category-" , "GeneralCategory" are all considered to be the same property.

, , , are all considered to be the same property. Aliases: the data files PropertyAliases.txt and PropertyValueAliases.txt define alternative ways of referring to properties and property values. Most aliases have long forms and short forms. For example: Long form: General_Category Short form: gc Examples of property value aliases (per line, all values are considered equal): Lowercase_Letter , Ll Currency_Symbol , Sc True , T , Yes , Y False , F , No , N

and define alternative ways of referring to properties and property values.

Unicode property escapes for regular expressions #

Unicode property escapes look like this:

Match all characters whose property prop has the value value : \p{prop=value} Match all characters that do not have a property prop whose value is value : \P{prop=value} Match all characters whose binary property bin_prop is True: \p{bin_prop} Match all characters whose binary property bin_prop is False: \P{bin_prop}

Forms (3) and (4) can also be used as an abbreviation for General_Category . For example: \p{Lowercase_Letter} is an abbreviation for \p{General_Category=Lowercase_Letter}

Important: In order to use property escapes, regular expressions must have the flag /u . Prior to /u , \p is the same as p .

Things to note:

Property escapes do not support loose matching. You must use aliases exactly as they are mentioned in PropertyAliases.txt and PropertyValueAliases.txt

and Implementations must support at least the following Unicode properties and their aliases: General_Category Script Script_Extensions The binary properties listed in the specification (and no others, to guarantee interoperability). These include, among others: Alphabetic , Uppercase , Lowercase , White_Space , Noncharacter_Code_Point , Default_Ignorable_Code_Point , Any , ASCII , Assigned , ID_Start , ID_Continue , Join_Control , Emoji_Presentation , Emoji_Modifier , Emoji_Modifier_Base .



Matching whitespace:

> /^\p{White_Space}+$/u.test('\t

\r') true

Matching letters:

> /^\p{Letter}+$/u.test('πüé') true

Matching Greek letters:

> /^\p{Script=Greek}+$/u.test('μετά') true

Matching Latin letters:

> /^\p{Script=Latin}+$/u.test('Grüße') true > /^\p{Script=Latin}+$/u.test('façon') true > /^\p{Script=Latin}+$/u.test('mañana') true

Matching lone surrogate characters:

> /^\p{Surrogate}+$/u.test('\u{D83D}') true > /^\p{Surrogate}+$/u.test('\u{DE00}') true

Note that Unicode code points in astral planes (such as emojis) are composed of two JavaScript characters (a leading surrogate and a trailing surrogate). Therefore, you’d expect the previous regular expression to match the emoji 😀, which is all surrogates:

> '😀'.length 2 > '😀'.charCodeAt(0).toString(16) 'd83d' > '😀'.charCodeAt(1).toString(16) 'de00'

However, with the /u flag, property escapes match code points, not JavaScript characters:

> /^\p{Surrogate}+$/u.test('😀') false

In other words, 😀 is considered to be a single character:

> /^.$/u.test('😀') true

Trying it out #

V8 5.8+ implement this proposal, it is switched on via --harmony_regexp_property :

Node.js: node --harmony_regexp_property Check Node’s version of V8 via npm version

Chrome: Go to chrome://version/ Check the version of V8. Find the “Executable Path”. For example: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome Start Chrome: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' --js-flags="--harmony_regexp_property"



Further reading #

JavaScript:

The Unicode standard: