Unicode Normalization and Zalgo Text
Understand how Unicode normalization forms (NFC, NFD, NFKC, NFKD) interact with Zalgo text and why normalization alone cannot remove Zalgo combining marks.
Detailed Explanation
Unicode Normalization vs. Zalgo
A common misconception is that Unicode normalization can fix Zalgo text. In reality, normalization forms handle canonical equivalences, not excess combining marks.
The Four Normalization Forms
| Form | Name | Effect |
|---|---|---|
| NFC | Canonical Decomposition + Composition | Composes precomposed characters (é → é) |
| NFD | Canonical Decomposition | Decomposes to base + combining (é → é) |
| NFKC | Compatibility Decomposition + Composition | Like NFC but also handles compatibility chars |
| NFKD | Compatibility Decomposition | Like NFD but also handles compatibility chars |
Why Normalization Doesn't Remove Zalgo
Normalization only deals with canonical equivalences — different ways to represent the same character. It does NOT remove combining marks that are "extra":
const zalgo = "H\u0300\u0301\u0302\u0303\u0304\u0305";
const nfc = zalgo.normalize('NFC');
// nfc still contains ALL combining marks
// Normalization may reorder them by CCC, but won't remove any
What Normalization DOES Do
- Reorders combining marks according to Canonical Combining Class (CCC)
- Composes base + combining into precomposed form where one exists
- Decomposes precomposed characters into base + combining
For example, NFC would compose e + \u0301 into \u00E9 (precomposed é), but it would NOT remove additional combining marks beyond that.
Canonical Combining Class (CCC)
Each combining mark has a CCC value that determines its rendering order:
- CCC 0: Spacing marks (not reordered)
- CCC 1: Overlay marks
- CCC 200+: Below marks
- CCC 230: Above marks
Marks with different CCC values are in canonical order. Marks with the same CCC value may be reordered during normalization.
The Correct Approach
To remove Zalgo, you must explicitly filter combining marks by Unicode General Category:
// Remove ALL combining marks:
text.replace(/\p{Mn}/gu, '');
// Or limit to a maximum per base character:
// (see "Stripping Zalgo Text" guide)
Use Case
Understanding the interaction between normalization and Zalgo is important for developers who assume normalization will sanitize text input. It is critical for building robust text processing pipelines that handle adversarial Unicode input correctly.