Count Unicode Characters — Grapheme Clusters and Code Points

Count Unicode characters correctly using grapheme cluster segmentation. Learn the difference between code units, code points, and grapheme clusters, and why string.length gives wrong results for emoji.

Advanced

Detailed Explanation

Unicode Character Counting

Counting "characters" in Unicode text is one of the most misunderstood problems in programming. The answer depends on what you mean by character — and there are at least three valid definitions.

Three Levels of "Character"

Consider the string containing the flag emoji for Japan: "🇯🇵"

Level Count How to Get It
UTF-16 code units 4 str.length
Unicode code points 2 [...str].length
Grapheme clusters 1 Intl.Segmenter

The user sees 1 character (a flag), but JavaScript reports a length of 4.

Code Units vs. Code Points

JavaScript strings are sequences of UTF-16 code units (16-bit values). Characters outside the Basic Multilingual Plane (above U+FFFF) require two code units called a surrogate pair:

const emoji = "😀";
console.log(emoji.length);        // 2 (UTF-16 code units)
console.log([...emoji].length);   // 1 (code points)

The spread operator or Array.from() iterates by code point, which is more useful but still not perfect.

Grapheme Clusters

A grapheme cluster is what a user perceives as a single character. Complex grapheme clusters include:

  • Flag emoji: 🇯🇵 = U+1F1EF + U+1F1F5 (2 code points, 1 grapheme)
  • Family emoji: 👨‍👩‍👧‍👦 = multiple code points joined by ZWJ (1 grapheme)
  • Accented characters: e + ́ combining accent = 2 code points, 1 grapheme
  • Skin tone emoji: 👋🏽 = U+1F44B + U+1F3FD (2 code points, 1 grapheme)

The Correct Way to Count

function countGraphemes(text) {
  const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
  return [...segmenter.segment(text)].length;
}

function countCodePoints(text) {
  return [...text].length;
}

function countCodeUnits(text) {
  return text.length;
}

Intl.Segmenter is supported in all modern browsers and Node.js 16+. It uses the Unicode text segmentation algorithm (UAX #29) to correctly identify grapheme cluster boundaries.

Practical Implications

  • Form validation: If limiting user input to N characters, use grapheme cluster count to match user expectations
  • Database storage: Use byte length (UTF-8 encoded) for storage limits
  • Display width: Neither code points nor graphemes tell you the visual width — use a library that considers East Asian width for terminal/monospace contexts
  • Substring operations: Slicing at code unit boundaries can split surrogate pairs, producing invalid strings

Use Case

Developers building text input components need accurate character counting for emoji and multi-language support. Internationalization (i18n) engineers ensure character limits work correctly across all scripts. Database designers choose field sizes based on worst-case byte sizes per grapheme, and QA engineers test character limits with complex Unicode inputs.

Try It — Word Counter

Open full tool