Unicode Normalization Best Practices
A comprehensive guide to Unicode normalization best practices for web developers, backend engineers, and system architects. Learn when, where, and how to normalize.
Detailed Explanation
Unicode Normalization Best Practices
Rule 1: Always Normalize Early
Normalize text at the point of entry into your system:
- Form submission handlers
- API request parsing
- File content reading
- Database query parameters
// Express.js middleware example
app.use((req, res, next) => {
if (req.body && typeof req.body === 'string') {
req.body = req.body.normalize('NFC');
}
next();
});
Rule 2: Choose One Form and Be Consistent
- NFC for storage and general use (W3C recommendation)
- NFKC for search indexes and security-sensitive comparison
- Never mix forms in the same system
Rule 3: Normalize Before Comparison
Always normalize both sides before comparing:
// BAD
if (input === stored) { ... }
// GOOD
if (input.normalize('NFC') === stored.normalize('NFC')) { ... }
// BEST (if stored data is already normalized)
if (input.normalize('NFC') === stored) { ... }
Rule 4: Normalize Before Hashing
Cryptographic hashes of non-normalized text will differ for equivalent strings:
sha256("café") // one hash
sha256("café") // different hash!
sha256("café".normalize("NFC")) // consistent hash
sha256("café".normalize("NFC")) // same hash
Rule 5: Document Your Normalization Policy
Include in your project's technical documentation:
- Which normalization form you use
- Where normalization is applied
- How legacy non-normalized data is handled
Rule 6: Test with Real Unicode Data
Include test cases with:
- Precomposed and decomposed forms of the same text
- Combining characters with multiple marks
- Korean Hangul in both forms
- Fullwidth/halfwidth characters
- Compatibility characters (ligatures, fractions)
Rule 7: Handle Legacy Data
For existing data that may not be normalized:
- Add a migration to normalize existing records
- Add a normalization step to data import pipelines
- Log warnings when non-normalized input is detected
Use Case
A practical checklist for engineering teams establishing Unicode handling standards for new projects, or auditing existing systems for normalization issues. Particularly valuable during code reviews and architecture design discussions for internationalized applications.
Try It — Unicode Normalizer
Related Topics
NFC vs NFD — Canonical Composition vs Decomposition
Core Forms
NFKC vs NFKD — Compatibility Composition vs Decomposition
Core Forms
Unicode Normalization in Databases
Use Cases
Unicode Normalization for Search and Indexing
Use Cases
Unicode Normalization and Security — Confusable Characters
Security
Unicode Normalization in Programming Languages
Programming