Unicode Normalization and Security — Confusable Characters

Understand how Unicode normalization relates to security: confusable characters (homoglyphs), spoofing attacks, and how NFKC helps prevent username and URL spoofing.

Security

Detailed Explanation

Unicode Normalization and Security

Unicode normalization plays a critical role in security. Without proper normalization, attackers can exploit visually similar characters to bypass security checks.

Confusable Characters (Homoglyphs)

Some characters from different scripts look identical or nearly identical:

Character Code Point Script
A U+0041 Latin
А U+0410 Cyrillic
Α U+0391 Greek

These are not equivalent under any normalization form because they belong to different scripts. However, Unicode normalization does handle within-script equivalences.

What Normalization Catches

NFKC normalization helps with:

  • Fullwidth attacks: adminadmin (NFKC maps fullwidth to ASCII)
  • Ligature spoofing: filefile (NFKC splits ligatures)
  • Compatibility equivalences: (Ohm) → Ω (Greek Omega)

What Normalization Does NOT Catch

  • Cross-script homoglyphs (Latin 'a' vs Cyrillic 'а')
  • Look-alike substitutions (rn vs m)
  • Zero-width characters (U+200B, U+200C, U+200D, U+FEFF)

For these, you need additional defenses like:

  • Script-mixing detection
  • Confusable character tables (Unicode TR39)
  • Zero-width character stripping

Username Security Pipeline

User input
  → Strip zero-width characters
  → NFKC normalize
  → Case fold
  → Check against confusable table
  → Reject if mixed scripts
  → Store

PRECIS Framework (RFC 8264)

The IETF PRECIS framework, used for usernames and passwords in modern protocols, mandates NFKC normalization as part of its preparation step.

Use Case

Essential for authentication systems, username registration, email address validation, URL filtering, and any security-sensitive text comparison. Without normalization, attackers can register usernames or domains that look identical to legitimate ones but bypass uniqueness checks.

Try It — Unicode Normalizer

Open full tool