Unicode Normalization and Zalgo Text

Understand how Unicode normalization forms (NFC, NFD, NFKC, NFKD) interact with Zalgo text and why normalization alone cannot remove Zalgo combining marks.

Technical

Detailed Explanation

Unicode Normalization vs. Zalgo

A common misconception is that Unicode normalization can fix Zalgo text. In reality, normalization forms handle canonical equivalences, not excess combining marks.

The Four Normalization Forms

Form Name Effect
NFC Canonical Decomposition + Composition Composes precomposed characters (é → é)
NFD Canonical Decomposition Decomposes to base + combining (é → é)
NFKC Compatibility Decomposition + Composition Like NFC but also handles compatibility chars
NFKD Compatibility Decomposition Like NFD but also handles compatibility chars

Why Normalization Doesn't Remove Zalgo

Normalization only deals with canonical equivalences — different ways to represent the same character. It does NOT remove combining marks that are "extra":

const zalgo = "H\u0300\u0301\u0302\u0303\u0304\u0305";
const nfc = zalgo.normalize('NFC');
// nfc still contains ALL combining marks
// Normalization may reorder them by CCC, but won't remove any

What Normalization DOES Do

  1. Reorders combining marks according to Canonical Combining Class (CCC)
  2. Composes base + combining into precomposed form where one exists
  3. Decomposes precomposed characters into base + combining

For example, NFC would compose e + \u0301 into \u00E9 (precomposed é), but it would NOT remove additional combining marks beyond that.

Canonical Combining Class (CCC)

Each combining mark has a CCC value that determines its rendering order:

  • CCC 0: Spacing marks (not reordered)
  • CCC 1: Overlay marks
  • CCC 200+: Below marks
  • CCC 230: Above marks

Marks with different CCC values are in canonical order. Marks with the same CCC value may be reordered during normalization.

The Correct Approach

To remove Zalgo, you must explicitly filter combining marks by Unicode General Category:

// Remove ALL combining marks:
text.replace(/\p{Mn}/gu, '');

// Or limit to a maximum per base character:
// (see "Stripping Zalgo Text" guide)

Use Case

Understanding the interaction between normalization and Zalgo is important for developers who assume normalization will sanitize text input. It is critical for building robust text processing pipelines that handle adversarial Unicode input correctly.

Try It — Zalgo Text Generator

Open full tool