Unicode Normalization in Databases
Learn how to handle Unicode normalization in PostgreSQL, MySQL, SQLite, and MongoDB. Understand why storing normalized text prevents data inconsistency and search failures.
Detailed Explanation
Normalization in Databases
Databases store text as bytes, and most databases do not automatically normalize Unicode on insert. This means the same visual text can be stored in different byte sequences, causing equality checks and unique constraints to behave unexpectedly.
The Duplicate Problem
Without normalization, a UNIQUE constraint on a username column might allow both:
café(U+00E9, NFC form)café(U+0065 + U+0301, NFD form)
These look identical to users but are different byte sequences.
PostgreSQL
PostgreSQL supports ICU collations (since version 10) that handle normalization-aware comparison:
CREATE COLLATION nfc_insensitive (
provider = icu,
locale = 'und-u-kn-true'
);
For explicit normalization, use normalize() (PostgreSQL 13+):
SELECT normalize('é', NFC);
SELECT normalize('é', NFC);
-- Both return 'é'
MySQL
MySQL/MariaDB uses collations for comparison but does not have a built-in NORMALIZE() function. Best practice is to normalize in the application layer before INSERT:
-- Application normalizes to NFC before this INSERT
INSERT INTO users (username) VALUES ('café');
MongoDB
MongoDB string comparison is byte-by-byte by default. Use a collation with strength: 1 for accent-insensitive comparison, or normalize in the application layer.
Best Practice: Normalize on Write
The safest approach is to normalize all text to NFC before writing to the database. This ensures consistency regardless of the source:
const normalized = userInput.normalize("NFC");
await db.insert({ name: normalized });
Use Case
Critical for any application storing user-generated text in a database, especially names, usernames, email addresses, and search terms. Without normalization, databases can contain invisible duplicates that bypass unique constraints and cause inconsistent query results.