Why does "πŸ‘©πŸΎβ€πŸŒΎ" have a length of 7 in JavaScript?

by
, updated (originally posted )

I turned this blog post into a talk which you might prefer.

In short: πŸ‘©πŸΎβ€πŸŒΎ is made of 1 grapheme cluster, 4 scalars, and 7 UTF-16 code units. That’s why its length is 7.

The length property is used to determine the length of a JavaScript string. Sometimes, its results are intuitive:

"E".length;
// => 1

"♬".length;
// => 1

…sometimes, its results are surprising:

"🌸".length;
// => 2

"πŸ‘©πŸΎβ€πŸŒΎ".length;
// => 7

To understand why this happens, you need to understand a few terms from the Unicode glossary.

The first term is the extended grapheme cluster. This is probably what most people would call a character. E, ♬, 🌸, and πŸ‘©πŸΎβ€πŸŒΎ are examples of extended grapheme clusters.

Extended grapheme clusters are made up of scalars. Scalars are integers between 0 and 1114111, though many of these numbers are currently unused.

Many extended grapheme clusters contain just one scalar. For example, 🌸 is made up of the scalar 127800 and E is made up of scalar 69. πŸ‘©πŸΎβ€πŸŒΎ, however, is made up of four scalars: 128105, 127998, 8205, and 127806.

(Scalars are usually written in hex with a “U+” prefix. For example, the scalar for ♬ is 9836, which might be written as “U+266C”.)

Internally, JavaScript stores these scalars as UTF-16 code units. Each code unit is a 16-bit unsigned integer, which can store numbers between 0 and 65,535. Many scalars fit into a single code unit. Scalars that are too big get split apart into two 16-bit numbers. These are called surrogate pairs, which is a term you might see.

For example, ♬ is made up of the scalar 9836. That fits into a single 16-bit integer, so we just store 9836.

The scalar for 🌸 is 127800. That’s too big for a 16-bit integer so we have to break it up. It gets split up into 55356 and 57144. (I won’t discuss how this splitting works, but it’s not too complicatedβ€”the bits are divided in the middle and a different number is added to each half.)

That’s why "🌸".length === 2β€”JavaScript is interrogating the number of UTF-16 code units, which is 2 in this case.

πŸ‘©πŸΎβ€πŸŒΎ is made up of four scalars. One of those scalars fits in a single UTF-16 code unit, but the remaining three are too big and get split up. That makes for a total of 7 code units. That’s why "πŸ‘©πŸΎβ€πŸŒΎ".length === 7.

To summarize our examples:

Extended grapheme clusterScalar(s)UTF-16 code units
E6969
♬98369836
🌸12780055356, 57144
πŸ‘©πŸΎβ€πŸŒΎ128105, 127998, 8205, 12780655357, 56425, 55356, 57342, 8205, 55356, 57150

Most JavaScript string operations also work with UTF-16.

slice(), for example, works with UTF-16 code units too. That’s why you might get strange results if you slice in the middle of a surrogate pair:

"The best character is X".slice(-1);
// => "X"

"The best character is 🌸".slice(-1);
// => "\udf38"

However, not all JavaScript string operations use UTF-16 code units. For example, iterating over a string works a little differently:

// The spread operator uses an iterator:
[..."πŸ‘©πŸΎβ€πŸŒΎ"];
// => ["πŸ‘©","🏾","","🌾"]

// Same for `for ... of`:
for (const c of "πŸ‘©πŸΎβ€πŸŒΎ") {
  console.log(c);
}
// => "πŸ‘©"
// => "🏾"
// => ""
// => "🌾"

As you can see, this iterates over scalars, not UTF-16 code units.

Intl.Segmenter(), an object that doesn’t work in all browsers, can help you iterate over extended grapheme clusters if that’s what you need:

const str = "farmer: πŸ‘©πŸΎβ€πŸŒΎ";

// Warning: this is not supported on all browsers!
const segments = new Intl.Segmenter().segment(str);
[...segments];
// => [
//      { segment: "f", index: 0, input: "farmer: πŸ‘©πŸΎβ€πŸŒΎ" },
//      { segment: "a", index: 1, input: "farmer: πŸ‘©πŸΎβ€πŸŒΎ" },
//      { segment: "r", index: 2, input: "farmer: πŸ‘©πŸΎβ€πŸŒΎ" },
//      { segment: "m", index: 3, input: "farmer: πŸ‘©πŸΎβ€πŸŒΎ" },
//      { segment: "e", index: 4, input: "farmer: πŸ‘©πŸΎβ€πŸŒΎ" },
//      { segment: "r", index: 5, input: "farmer: πŸ‘©πŸΎβ€πŸŒΎ" },
//      { segment: ":", index: 6, input: "farmer: πŸ‘©πŸΎβ€πŸŒΎ" },
//      { segment: " ", index: 7, input: "farmer: πŸ‘©πŸΎβ€πŸŒΎ" },
//      { segment: "πŸ‘©πŸΎβ€πŸŒΎ", index: 8, input: "farmer: πŸ‘©πŸΎβ€πŸŒΎ" }
//    ]

For more on this tricky stuff, check out “It’s Not Wrong that "πŸ€¦πŸΌβ€β™‚οΈ".length == 7, “The Absolute Minimum Every Software Developer Must Know About Unicode in 2023”, “JavaScript has a Unicode problem”, and a talk I gave on this topic.