Unicode Normalization and Zalgo Text

Q: Unicode Normalization and Zalgo Text

## Unicode Normalization vs. Zalgo A common misconception is that Unicode normalization can fix Zalgo text. In reality, normalization forms handle canonical equivalences, not excess combining marks. ### The Four Normalization Forms | Form | Name | Effect | |------|------|--------| | NFC | Canonical Decomposition + Composition | Composes precomposed characters (é → é) | | NFD | Canonical Decomposition | Decomposes to base + combining (é → é) | | NFKC | Compatibility Decomposition + Compositi

Understand how Unicode normalization forms (NFC, NFD, NFKC, NFKD) interact with Zalgo text and why normalization alone cannot remove Zalgo combining marks.

Technical

Detailed Explanation

Unicode Normalization vs. Zalgo

A common misconception is that Unicode normalization can fix Zalgo text. In reality, normalization forms handle canonical equivalences, not excess combining marks.

The Four Normalization Forms

Form	Name	Effect
NFC	Canonical Decomposition + Composition	Composes precomposed characters (é → é)
NFD	Canonical Decomposition	Decomposes to base + combining (é → é)
NFKC	Compatibility Decomposition + Composition	Like NFC but also handles compatibility chars
NFKD	Compatibility Decomposition	Like NFD but also handles compatibility chars

Why Normalization Doesn't Remove Zalgo

Normalization only deals with canonical equivalences — different ways to represent the same character. It does NOT remove combining marks that are "extra":

const zalgo = "H\u0300\u0301\u0302\u0303\u0304\u0305";
const nfc = zalgo.normalize('NFC');
// nfc still contains ALL combining marks
// Normalization may reorder them by CCC, but won't remove any

What Normalization DOES Do

Reorders combining marks according to Canonical Combining Class (CCC)
Composes base + combining into precomposed form where one exists
Decomposes precomposed characters into base + combining

For example, NFC would compose e + \u0301 into \u00E9 (precomposed é), but it would NOT remove additional combining marks beyond that.

Canonical Combining Class (CCC)

Each combining mark has a CCC value that determines its rendering order:

CCC 0: Spacing marks (not reordered)
CCC 1: Overlay marks
CCC 200+: Below marks
CCC 230: Above marks

Marks with different CCC values are in canonical order. Marks with the same CCC value may be reordered during normalization.

The Correct Approach

To remove Zalgo, you must explicitly filter combining marks by Unicode General Category:

// Remove ALL combining marks:
text.replace(/\p{Mn}/gu, '');

// Or limit to a maximum per base character:
// (see "Stripping Zalgo Text" guide)

Use Case

Understanding the interaction between normalization and Zalgo is important for developers who assume normalization will sanitize text input. It is critical for building robust text processing pipelines that handle adversarial Unicode input correctly.

Try It — Zalgo Text Generator

Open full tool →