Unicode Normalization Best Practices

A comprehensive guide to Unicode normalization best practices for web developers, backend engineers, and system architects. Learn when, where, and how to normalize.

Best Practices

Detailed Explanation

Unicode Normalization Best Practices

Rule 1: Always Normalize Early

Normalize text at the point of entry into your system:

  • Form submission handlers
  • API request parsing
  • File content reading
  • Database query parameters
// Express.js middleware example
app.use((req, res, next) => {
  if (req.body && typeof req.body === 'string') {
    req.body = req.body.normalize('NFC');
  }
  next();
});

Rule 2: Choose One Form and Be Consistent

  • NFC for storage and general use (W3C recommendation)
  • NFKC for search indexes and security-sensitive comparison
  • Never mix forms in the same system

Rule 3: Normalize Before Comparison

Always normalize both sides before comparing:

// BAD
if (input === stored) { ... }

// GOOD
if (input.normalize('NFC') === stored.normalize('NFC')) { ... }

// BEST (if stored data is already normalized)
if (input.normalize('NFC') === stored) { ... }

Rule 4: Normalize Before Hashing

Cryptographic hashes of non-normalized text will differ for equivalent strings:

sha256("café")     // one hash
sha256("café")    // different hash!
sha256("café".normalize("NFC"))    // consistent hash
sha256("café".normalize("NFC"))   // same hash

Rule 5: Document Your Normalization Policy

Include in your project's technical documentation:

  • Which normalization form you use
  • Where normalization is applied
  • How legacy non-normalized data is handled

Rule 6: Test with Real Unicode Data

Include test cases with:

  • Precomposed and decomposed forms of the same text
  • Combining characters with multiple marks
  • Korean Hangul in both forms
  • Fullwidth/halfwidth characters
  • Compatibility characters (ligatures, fractions)

Rule 7: Handle Legacy Data

For existing data that may not be normalized:

  • Add a migration to normalize existing records
  • Add a normalization step to data import pipelines
  • Log warnings when non-normalized input is detected

Use Case

A practical checklist for engineering teams establishing Unicode handling standards for new projects, or auditing existing systems for normalization issues. Particularly valuable during code reviews and architecture design discussions for internationalized applications.

Try It — Unicode Normalizer

Open full tool