Unicode Normalization for Search and Indexing
Learn how to apply Unicode normalization to improve search accuracy. Understand why NFKC is preferred for search indexes and how to handle accented characters in queries.
Detailed Explanation
Normalization for Search
Search engines and text indexing systems must handle the reality that users type text in many different ways. Unicode normalization is a critical preprocessing step for accurate search.
The Problem Without Normalization
Consider a database containing the name "café" stored as café (NFC). A user searches for "café" but their system sends café (NFD). Without normalization:
"café" (stored) ≠ "café" (query)
The search fails even though the text is visually identical.
NFKC for Search Indexes
For search, NFKC is typically the best choice because it:
- Canonically composes characters (like NFC)
- Decomposes compatibility characters, treating visually similar characters as equivalent:
- Fullwidth
A→ ASCIIA - Ligature
fi→fi - Superscript
²→2
- Fullwidth
Search Pipeline Best Practice
Input text
→ NFKC normalize
→ Case fold (toLowerCase)
→ Remove accents (optional, via NFD + strip combining marks)
→ Tokenize
→ Index
Accent-Insensitive Search
For accent-insensitive search, normalize to NFD and then strip combining marks (U+0300–U+036F):
function removeAccents(str) {
return str.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
}
removeAccents("café") // "cafe"
removeAccents("naïve") // "naive"
Database-Level Normalization
PostgreSQL supports ICU collations with normalization. MySQL and SQLite can normalize via application-level preprocessing before insertion.
Use Case
Used by search engines (Elasticsearch, Solr, MeiliSearch), database full-text search systems, and any application that needs to match user queries against stored text. Particularly important for multilingual applications serving users who type in different keyboard layouts and input methods.