How Zalgo Text Works at the Unicode Level
Explore the internal Unicode representation of Zalgo text, including code points, grapheme clusters, and how renderers handle excessive combining marks.
Detailed Explanation
Zalgo at the Unicode Level
To truly understand Zalgo text, you need to look at what happens at the code point level and how text rendering systems process the stacked combining marks.
Code Point Representation
A single "zalgo character" is actually multiple Unicode code points:
Base character: H (U+0048)
Combining above: ̀ (U+0300) ́ (U+0301) ̂ (U+0302)
Combining below: ̧ (U+0327) ̰ (U+0330)
Combining mid: ̶ (U+0336)
The string "H" with 6 combining marks is stored as 7 code points but renders as a single (heavily modified) glyph.
String Length vs. Visual Length
This has important implications for string handling:
const zalgo = "Ḩ̶̰̀́̂";
zalgo.length; // 7 (code units)
[...zalgo].length; // 7 (code points)
// But visually it appears as ONE character
Grapheme Clusters
Unicode defines grapheme clusters as user-perceived characters. A base character plus all its combining marks forms a single extended grapheme cluster. The Intl.Segmenter API correctly identifies this:
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = [...segmenter.segment(zalgo)];
segments.length; // 1 (one grapheme cluster)
Rendering Behavior
Text rendering engines handle excessive combining marks differently:
- Most browsers: Attempt to render all marks, causing visual overflow
- Terminal emulators: May truncate or ignore excess marks
- Mobile devices: May limit rendering to prevent performance issues
- PDF generators: Usually render all marks faithfully
Canonical Ordering
Unicode specifies a Canonical Combining Class (CCC) for each combining mark, which determines rendering order. Marks with the same CCC value may be reordered during normalization. Marks with CCC 0 (spacing marks) are never reordered.
Use Case
Understanding Unicode internals is essential for developers building text processing pipelines, input sanitization systems, or debugging rendering issues with internationalized text that contains unexpected combining characters.