Unicode Normalization for Search and Indexing

Learn how to apply Unicode normalization to improve search accuracy. Understand why NFKC is preferred for search indexes and how to handle accented characters in queries.

Use Cases

Detailed Explanation

Normalization for Search

Search engines and text indexing systems must handle the reality that users type text in many different ways. Unicode normalization is a critical preprocessing step for accurate search.

The Problem Without Normalization

Consider a database containing the name "café" stored as café (NFC). A user searches for "café" but their system sends café (NFD). Without normalization:

"café" (stored)  ≠  "café" (query)

The search fails even though the text is visually identical.

NFKC for Search Indexes

For search, NFKC is typically the best choice because it:

  1. Canonically composes characters (like NFC)
  2. Decomposes compatibility characters, treating visually similar characters as equivalent:
    • Fullwidth → ASCII A
    • Ligature fi
    • Superscript ²2

Search Pipeline Best Practice

Input text
  → NFKC normalize
  → Case fold (toLowerCase)
  → Remove accents (optional, via NFD + strip combining marks)
  → Tokenize
  → Index

Accent-Insensitive Search

For accent-insensitive search, normalize to NFD and then strip combining marks (U+0300–U+036F):

function removeAccents(str) {
  return str.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
}
removeAccents("café")  // "cafe"
removeAccents("naïve") // "naive"

Database-Level Normalization

PostgreSQL supports ICU collations with normalization. MySQL and SQLite can normalize via application-level preprocessing before insertion.

Use Case

Used by search engines (Elasticsearch, Solr, MeiliSearch), database full-text search systems, and any application that needs to match user queries against stored text. Particularly important for multilingual applications serving users who type in different keyboard layouts and input methods.

Try It — Unicode Normalizer

Open full tool