Unicode Normalization in Databases

Learn how to handle Unicode normalization in PostgreSQL, MySQL, SQLite, and MongoDB. Understand why storing normalized text prevents data inconsistency and search failures.

Use Cases

Detailed Explanation

Normalization in Databases

Databases store text as bytes, and most databases do not automatically normalize Unicode on insert. This means the same visual text can be stored in different byte sequences, causing equality checks and unique constraints to behave unexpectedly.

The Duplicate Problem

Without normalization, a UNIQUE constraint on a username column might allow both:

  • café (U+00E9, NFC form)
  • café (U+0065 + U+0301, NFD form)

These look identical to users but are different byte sequences.

PostgreSQL

PostgreSQL supports ICU collations (since version 10) that handle normalization-aware comparison:

CREATE COLLATION nfc_insensitive (
  provider = icu,
  locale = 'und-u-kn-true'
);

For explicit normalization, use normalize() (PostgreSQL 13+):

SELECT normalize('é', NFC);
SELECT normalize('é', NFC);
-- Both return 'é'

MySQL

MySQL/MariaDB uses collations for comparison but does not have a built-in NORMALIZE() function. Best practice is to normalize in the application layer before INSERT:

-- Application normalizes to NFC before this INSERT
INSERT INTO users (username) VALUES ('café');

MongoDB

MongoDB string comparison is byte-by-byte by default. Use a collation with strength: 1 for accent-insensitive comparison, or normalize in the application layer.

Best Practice: Normalize on Write

The safest approach is to normalize all text to NFC before writing to the database. This ensures consistency regardless of the source:

const normalized = userInput.normalize("NFC");
await db.insert({ name: normalized });

Use Case

Critical for any application storing user-generated text in a database, especially names, usernames, email addresses, and search terms. Without normalization, databases can contain invisible duplicates that bypass unique constraints and cause inconsistent query results.

Try It — Unicode Normalizer

Open full tool