Unicode Normalization for Text Comparison
Learn how to properly compare Unicode text strings using normalization. Avoid subtle bugs where visually identical strings fail equality checks due to different code point representations.
Detailed Explanation
Normalizing for Correct Text Comparison
String comparison is the most common reason to use Unicode normalization. Without it, two visually identical strings can fail an equality check.
The Problem
const a = "é"; // U+00E9 (precomposed)
const b = "é"; // U+0065 + U+0301 (decomposed)
a === b; // false!
a.length === b.length; // false! (1 vs 2)
Both a and b display as "é" but they are not byte-equal.
The Solution
a.normalize("NFC") === b.normalize("NFC"); // true
a.normalize("NFD") === b.normalize("NFD"); // true
Choosing the Right Form for Comparison
| Scenario | Recommended Form |
|---|---|
| Exact text comparison | NFC or NFD (either works, be consistent) |
| Search / indexing | NFKC (treats compatibility characters as equal) |
| Username comparison | NFKC + case fold |
| File path comparison | NFC (cross-platform safe) |
| Cryptographic hashing | NFC (canonical, compact) |
Sorting and Collation
Normalization ensures correct sorting of accented characters:
const names = ["Zoë", "Zoë"];
// Without normalization, these might sort differently
// With normalization, they are treated as identical
const sorted = names
.map(n => n.normalize("NFC"))
.sort((a, b) => a.localeCompare(b));
Hash-Based Comparison
If you are using hash-based data structures (hash maps, sets, checksums), normalization is critical:
const set = new Set();
set.add("café".normalize("NFC"));
set.has("café".normalize("NFC")); // true
// Without normalization:
const set2 = new Set();
set2.add("café");
set2.has("café"); // false!
Performance Tip
Normalize once at the point of input (form submission, file read, API response), not repeatedly at each comparison. Store the normalized form.
Use Case
Fundamental for any application comparing text: authentication systems verifying passwords and usernames, duplicate detection in databases, spell checkers, autocomplete systems, and test frameworks comparing expected vs actual output.