String Length Surprises with Zalgo Text
Discover how Zalgo text affects string length calculations in JavaScript, Python, and other languages, and the difference between code points, code units, and grapheme clusters.
Detailed Explanation
String Length and Zalgo Text
Zalgo text creates a disconnect between visual length and programmatic string length. A single visible character can contain dozens of code points.
JavaScript String Length
const clean = "Hello";
const zalgo = "H\u0300\u0301\u0302e\u0303\u0304l\u0305\u0306l\u0307\u0308o\u0309\u030A";
clean.length; // 5
zalgo.length; // 15 (5 base + 10 combining)
JavaScript's .length counts UTF-16 code units, which includes every combining mark. This means zalgo text reports a much longer length than it appears.
Grapheme-Aware Length
To get the "visual" length, use the Intl.Segmenter API:
function graphemeLength(str) {
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
return [...segmenter.segment(str)].length;
}
graphemeLength(clean); // 5
graphemeLength(zalgo); // 5 (same visual length!)
Python
import unicodedata
clean = "Hello"
# Zalgo version with combining marks
zalgo = "H\u0300\u0301\u0302e\u0303\u0304l\u0305\u0306l\u0307\u0308o\u0309\u030A"
len(clean) # 5
len(zalgo) # 15
# For grapheme clusters:
import grapheme
grapheme.length(zalgo) # 5
Practical Implications
- Character limits: A 280-character tweet limit counts code points, so zalgo text eats up the limit quickly
- Database storage: VARCHAR(100) may not hold 100 visible zalgo characters
- Input validation: Checking
input.length <= 50may reject zalgo text that looks like only 10 characters - Truncation: Naively truncating at index N may cut in the middle of a grapheme cluster
- Bandwidth: Zalgo text is significantly larger in bytes than its clean equivalent
Safe Truncation
function safeTruncate(str, maxGraphemes) {
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = [...segmenter.segment(str)];
return segments.slice(0, maxGraphemes).map(s => s.segment).join('');
}
Use Case
Understanding string length behavior with Zalgo text is critical for developers implementing character limits, input validation, database schemas, and text truncation in applications that handle user-generated Unicode content.