Unicode Normalization and Security — Confusable Characters
Understand how Unicode normalization relates to security: confusable characters (homoglyphs), spoofing attacks, and how NFKC helps prevent username and URL spoofing.
Detailed Explanation
Unicode Normalization and Security
Unicode normalization plays a critical role in security. Without proper normalization, attackers can exploit visually similar characters to bypass security checks.
Confusable Characters (Homoglyphs)
Some characters from different scripts look identical or nearly identical:
| Character | Code Point | Script |
|---|---|---|
| A | U+0041 | Latin |
| А | U+0410 | Cyrillic |
| Α | U+0391 | Greek |
These are not equivalent under any normalization form because they belong to different scripts. However, Unicode normalization does handle within-script equivalences.
What Normalization Catches
NFKC normalization helps with:
- Fullwidth attacks:
admin→admin(NFKC maps fullwidth to ASCII) - Ligature spoofing:
file→file(NFKC splits ligatures) - Compatibility equivalences:
Ω(Ohm) →Ω(Greek Omega)
What Normalization Does NOT Catch
- Cross-script homoglyphs (Latin 'a' vs Cyrillic 'а')
- Look-alike substitutions (
rnvsm) - Zero-width characters (U+200B, U+200C, U+200D, U+FEFF)
For these, you need additional defenses like:
- Script-mixing detection
- Confusable character tables (Unicode TR39)
- Zero-width character stripping
Username Security Pipeline
User input
→ Strip zero-width characters
→ NFKC normalize
→ Case fold
→ Check against confusable table
→ Reject if mixed scripts
→ Store
PRECIS Framework (RFC 8264)
The IETF PRECIS framework, used for usernames and passwords in modern protocols, mandates NFKC normalization as part of its preparation step.
Use Case
Essential for authentication systems, username registration, email address validation, URL filtering, and any security-sensitive text comparison. Without normalization, attackers can register usernames or domains that look identical to legitimate ones but bypass uniqueness checks.