Combining Characters and Diacritical Marks
Understand how combining diacritical marks create visual characters from multiple code points, and why grapheme cluster count differs from code point count.
Detailed Explanation
Combining Characters: Multiple Code Points, One Visual Character
Unicode allows characters to be composed from a base character plus one or more combining marks. The result looks like a single character but consists of multiple code points.
Example: é Two Ways
Precomposed (NFC):
é → U+00E9 (1 code point, 2 UTF-8 bytes)
Decomposed (NFD):
é → U+0065 + U+0301 (2 code points, 3 UTF-8 bytes)
Both render identically as é, but:
| Metric | Precomposed | Decomposed |
|---|---|---|
.length |
1 | 2 |
| Code points | 1 | 2 |
| Grapheme clusters | 1 | 1 |
| UTF-8 bytes | 2 | 3 |
Stacked Combining Marks
You can stack multiple combining marks on a single base character:
à́̂ → a + grave + acute + circumflex
This creates one grapheme cluster from 4 code points. JavaScript's .length returns 4, but visually it is one character.
Zalgo Text
"Zalgo text" exploits combining marks by stacking dozens of them:
H̶̺̘e̸͈l̷̙l̶̽o̵͓
Each visible letter may have 2-3 combining marks, dramatically inflating the code point count while the grapheme count stays relatively low. The String Length Calculator's grapheme breakdown reveals exactly which combining marks are attached to each base character.
Practical Impact
- String truncation: Cutting a string at a fixed code point count may split a combining sequence, producing garbled output. Always truncate at grapheme boundaries.
- Input validation: A "5 character" limit should count grapheme clusters, not code points, to avoid rejecting valid text like "é" in decomposed form.
- String comparison: "café" (NFC) and "café" (NFD) look identical but are different byte sequences. Normalize before comparing.
Use Case
When building text editors, input validators, or search functionality that handles international text, understanding combining characters prevents bugs like broken truncation, inconsistent search results, and incorrect character counting for user-facing limits.