Zero-Width Characters and Invisible String Length
Explore zero-width joiners, zero-width spaces, and other invisible Unicode characters that add to string length without being visible to the user.
Detailed Explanation
Zero-Width Characters: Invisible But Counted
Several Unicode characters have zero visual width but still count toward string length. These can cause subtle bugs in validation, comparison, and storage.
Common Zero-Width Characters
| Character | Code Point | UTF-8 Bytes | Purpose |
|---|---|---|---|
| Zero Width Space (ZWSP) | U+200B | 3 | Word break hint |
| Zero Width Joiner (ZWJ) | U+200D | 3 | Joins emoji sequences |
| Zero Width Non-Joiner (ZWNJ) | U+200C | 3 | Prevents ligatures |
| Word Joiner (WJ) | U+2060 | 3 | Prevents line break |
| Soft Hyphen | U+00AD | 2 | Optional hyphenation point |
| BOM (Byte Order Mark) | U+FEFF | 3 | Encoding indicator |
The ZWJ in Emoji
The Zero Width Joiner (U+200D) is what makes complex emoji possible:
👩 + ZWJ + 🚀 = 👩🚀 (woman astronaut)
Each ZWJ adds 3 UTF-8 bytes and 1 code unit to the string, but no visible width. A family emoji with 3 ZWJs adds 9 invisible bytes.
Hidden Text Attacks
Zero-width characters can be used to:
- Bypass filters: Insert ZWSP between banned words so they are not detected
- Watermark text: Embed invisible patterns to track copy-paste
- Break validation: A seemingly empty input that has non-zero length
"".length // 0
"\u200B".length // 1 (looks empty but isn't!)
"\u200B\u200C\u200D".length // 3 (three invisible characters)
Detection and Removal
// Detect zero-width characters
const hasZeroWidth = /[\u200B\u200C\u200D\u2060\uFEFF]/.test(str);
// Remove zero-width characters
const clean = str.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "");
Impact on String Comparison
Two strings that look identical may differ in zero-width characters:
"hello" === "hel\u200Blo" // false!
The String Length Calculator's grapheme breakdown helps identify these invisible characters by showing the exact code points for each position.
Use Case
When building input sanitization, content moderation, or anti-spam systems, detecting zero-width characters helps prevent filter bypass, hidden text attacks, and invisible string length inflation that can cause unexpected database or API errors.