ES2018: RegExp Unicode property escapes

[2017-07-19] dev, javascript, esnext, es2018, regexp
(Ad, please don’t block)

The proposal “RegExp Unicode Property Escapes” by Mathias Bynens is at stage 4. This blog post explains how it works.

Overview  

JavaScript lets you match characters by mentioning the “names” of sets of characters. For example, \s stands for “whitespace”:

> /^\s+$/u.test('\t \n\r')
true

The proposal lets you additionally match characters by mentioning their Unicode character properties (what those are is explained next) inside the curly braces of \p{}. Two examples:

> /^\p{White_Space}+$/u.test('\t \n\r')
true
> /^\p{Script=Greek}+$/u.test('μετά')
true

As you can see, one of the benefits of property escapes is is that they make regular expressions more self-descriptive. Additional benefits will become clear later.

Before we delve into how property escapes work, let’s examine what Unicode character properties are.

Unicode character properties  

In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character. Quoting the Unicode Standard, Sect. 3.3, D3:

The semantics of a character are determined by its identity, normative properties, and behavior.

Examples of properties  

These are a few examples of properties:

  • Name: a unique name, composed of uppercase letters, digits, hyphens and spaces. For example:
    • A: Name = LATIN CAPITAL LETTER A
    • 😀: Name = GRINNING FACE
  • General_Category: categorizes characters. For example:
    • x: General_Category = Lowercase_Letter
    • $: General_Category = Currency_Symbol
  • White_Space: used for marking invisible spacing characters, such as spaces, tabs and newlines. For example:
    • \t: White_Space = True
    • π: White_Space = False
  • Age: version of the Unicode Standard in which a character was introduced. For example: The Euro sign € was added in version 2.1 of the Unicode standard.
    • €: Age = 2.1
  • Block: a contiguous range of code points. Blocks don’t overlap and their names are unique. For example:
    • S: Block = Basic_Latin (range U+0000..U+007F)
    • Д: Block = Cyrillic (range U+0400..U+04FF)
  • Script: is a collection of characters used by one or more writing systems.
    • Some scripts support several writing systems. For example, the Latin script supports the writing systems English, French, German, Latin, etc.
    • Some languages can be written in multiple alternate writing systems that are supported by multiple scripts. For example, Turkish used the Arabic script before it transitioned to the Latin script in the early 20th century.
    • Examples:
      • α: Script = Greek
      • א: Script = Hebrew

Types of properties  

The following types of properties exist:

  • Enumerated property: a property whose values are few and named. General_Category is an enumerated property.
  • Closed enumerated property: an enumerated property whose set of values is fixed and will not be changed in future versions of the Unicode Standard.
  • Boolean property: a closed enumerated property whose values are True and False. Boolean properties are also called binary, because they are like markers that characters either have or not. White_Space is a binary property.
  • Numeric property: has values that are integers or real numbers.
  • String-valued property: a property whose values are strings.
  • Catalog property: an enumerated property that may be extended as the Unicode Standard evolves. Age and Script are catalog properties.
  • Miscellaneous property: a property whose values are not Boolean, enumerated, numeric, string or catalog values. Name is a miscellaneous property.

Matching properties and property values  

Properties and property values are matched as follows:

  • Loose matching: case, whitespace, underscores and hyphens are ignored when comparing properties and property values. For example, "General_Category", "general category", "-general-category-", "GeneralCategory" are all considered to be the same property.
  • Aliases: the data files PropertyAliases.txt and PropertyValueAliases.txt define alternative ways of referring to properties and property values.
    • Most aliases have long forms and short forms. For example:
      • Long form: General_Category
      • Short form: gc
    • Examples of property value aliases (per line, all values are considered equal):
      • Lowercase_Letter, Ll
      • Currency_Symbol, Sc
      • True, T, Yes, Y
      • False, F, No, N

Unicode property escapes for regular expressions  

Unicode property escapes look like this:

  1. Match all characters whose property prop has the value value:
    \p{prop=value}
    
  2. Match all characters that do not have a property prop whose value is value:
    \P{prop=value}
    
  3. Match all characters whose binary property bin_prop is True:
    \p{bin_prop}
    
  4. Match all characters whose binary property bin_prop is False:
    \P{bin_prop}
    

Forms (3) and (4) can also be used as an abbreviation for General_Category. For example: \p{Lowercase_Letter} is an abbreviation for \p{General_Category=Lowercase_Letter}

Important: In order to use property escapes, regular expressions must have the flag /u. Prior to /u, \p is the same as p.

Details  

Things to note:

  • Property escapes do not support loose matching. You must use aliases exactly as they are mentioned in PropertyAliases.txt and PropertyValueAliases.txt
  • Implementations must support at least the following Unicode properties and their aliases:
    • General_Category
    • Script
    • Script_Extensions
    • The binary properties listed in the specification (and no others, to guarantee interoperability). These include, among others: Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, Any, ASCII, Assigned, ID_Start, ID_Continue, Join_Control, Emoji_Presentation, Emoji_Modifier, Emoji_Modifier_Base.

Examples  

Matching whitespace:

> /^\p{White_Space}+$/u.test('\t \n\r')
true

Matching letters:

> /^\p{Letter}+$/u.test('πüé')
true

Matching Greek letters:

> /^\p{Script=Greek}+$/u.test('μετά')
true

Matching Latin letters:

> /^\p{Script=Latin}+$/u.test('Grüße')
true
> /^\p{Script=Latin}+$/u.test('façon')
true
> /^\p{Script=Latin}+$/u.test('mañana')
true

Matching lone surrogate characters:

> /^\p{Surrogate}+$/u.test('\u{D83D}')
true
> /^\p{Surrogate}+$/u.test('\u{DE00}')
true

Note that Unicode code points in astral planes (such as emojis) are composed of two JavaScript characters (a leading surrogate and a trailing surrogate). Therefore, you’d expect the previous regular expression to match the emoji 😀, which is all surrogates:

> '😀'.length
2
> '😀'.charCodeAt(0).toString(16)
'd83d'
> '😀'.charCodeAt(1).toString(16)
'de00'

However, with the /u flag, property escapes match code points, not JavaScript characters:

> /^\p{Surrogate}+$/u.test('😀')
false

In other words, 😀 is considered to be a single character:

> /^.$/u.test('😀')
true

Trying it out  

V8 5.8+ implement this proposal, it is switched on via --harmony_regexp_property:

  • Node.js: node --harmony_regexp_property
    • Check Node’s version of V8 via npm version
  • Chrome:
    • Go to chrome://version/
    • Check the version of V8.
    • Find the “Executable Path”. For example: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome
    • Start Chrome: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' --js-flags="--harmony_regexp_property"

Further reading  

JavaScript:

The Unicode standard: