Count Unicode Characters — Grapheme Clusters and Code Points
Count Unicode characters correctly using grapheme cluster segmentation. Learn the difference between code units, code points, and grapheme clusters, and why string.length gives wrong results for emoji.
Detailed Explanation
Unicode Character Counting
Counting "characters" in Unicode text is one of the most misunderstood problems in programming. The answer depends on what you mean by character — and there are at least three valid definitions.
Three Levels of "Character"
Consider the string containing the flag emoji for Japan: "🇯🇵"
| Level | Count | How to Get It |
|---|---|---|
| UTF-16 code units | 4 | str.length |
| Unicode code points | 2 | [...str].length |
| Grapheme clusters | 1 | Intl.Segmenter |
The user sees 1 character (a flag), but JavaScript reports a length of 4.
Code Units vs. Code Points
JavaScript strings are sequences of UTF-16 code units (16-bit values). Characters outside the Basic Multilingual Plane (above U+FFFF) require two code units called a surrogate pair:
const emoji = "😀";
console.log(emoji.length); // 2 (UTF-16 code units)
console.log([...emoji].length); // 1 (code points)
The spread operator or Array.from() iterates by code point, which is more useful but still not perfect.
Grapheme Clusters
A grapheme cluster is what a user perceives as a single character. Complex grapheme clusters include:
- Flag emoji: 🇯🇵 = U+1F1EF + U+1F1F5 (2 code points, 1 grapheme)
- Family emoji: 👨👩👧👦 = multiple code points joined by ZWJ (1 grapheme)
- Accented characters:
e+́combining accent = 2 code points, 1 grapheme - Skin tone emoji: 👋🏽 = U+1F44B + U+1F3FD (2 code points, 1 grapheme)
The Correct Way to Count
function countGraphemes(text) {
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
return [...segmenter.segment(text)].length;
}
function countCodePoints(text) {
return [...text].length;
}
function countCodeUnits(text) {
return text.length;
}
Intl.Segmenter is supported in all modern browsers and Node.js 16+. It uses the Unicode text segmentation algorithm (UAX #29) to correctly identify grapheme cluster boundaries.
Practical Implications
- Form validation: If limiting user input to N characters, use grapheme cluster count to match user expectations
- Database storage: Use byte length (UTF-8 encoded) for storage limits
- Display width: Neither code points nor graphemes tell you the visual width — use a library that considers East Asian width for terminal/monospace contexts
- Substring operations: Slicing at code unit boundaries can split surrogate pairs, producing invalid strings
Use Case
Developers building text input components need accurate character counting for emoji and multi-language support. Internationalization (i18n) engineers ensure character limits work correctly across all scripts. Database designers choose field sizes based on worst-case byte sizes per grapheme, and QA engineers test character limits with complex Unicode inputs.