Unicode Normalization for Text Comparison

Learn how to properly compare Unicode text strings using normalization. Avoid subtle bugs where visually identical strings fail equality checks due to different code point representations.

Best Practices

Detailed Explanation

Normalizing for Correct Text Comparison

String comparison is the most common reason to use Unicode normalization. Without it, two visually identical strings can fail an equality check.

The Problem

const a = "é";      // U+00E9 (precomposed)
const b = "é";     // U+0065 + U+0301 (decomposed)

a === b;                  // false!
a.length === b.length;    // false! (1 vs 2)

Both a and b display as "é" but they are not byte-equal.

The Solution

a.normalize("NFC") === b.normalize("NFC");  // true
a.normalize("NFD") === b.normalize("NFD");  // true

Choosing the Right Form for Comparison

Scenario Recommended Form
Exact text comparison NFC or NFD (either works, be consistent)
Search / indexing NFKC (treats compatibility characters as equal)
Username comparison NFKC + case fold
File path comparison NFC (cross-platform safe)
Cryptographic hashing NFC (canonical, compact)

Sorting and Collation

Normalization ensures correct sorting of accented characters:

const names = ["Zoë", "Zoë"];
// Without normalization, these might sort differently
// With normalization, they are treated as identical
const sorted = names
  .map(n => n.normalize("NFC"))
  .sort((a, b) => a.localeCompare(b));

Hash-Based Comparison

If you are using hash-based data structures (hash maps, sets, checksums), normalization is critical:

const set = new Set();
set.add("café".normalize("NFC"));
set.has("café".normalize("NFC"));  // true

// Without normalization:
const set2 = new Set();
set2.add("café");
set2.has("café");  // false!

Performance Tip

Normalize once at the point of input (form submission, file read, API response), not repeatedly at each comparison. Store the normalized form.

Use Case

Fundamental for any application comparing text: authentication systems verifying passwords and usernames, duplicate detection in databases, spell checkers, autocomplete systems, and test frameworks comparing expected vs actual output.

Try It — Unicode Normalizer

Open full tool