Unicode Normalization for URL Comparison
Understand how Unicode normalization affects URLs and Internationalized Domain Names (IDN). Learn why normalizing before comparison prevents duplicate detection failures.
Detailed Explanation
Normalization in URLs
URLs can contain Unicode characters in two ways: directly in Internationalized Domain Names (IDN) and percent-encoded in the path and query components. Normalization is critical for comparing and deduplicating URLs.
Internationalized Domain Names (IDN)
Domain names like café.com are converted to Punycode (xn--caf-dma.com) for DNS. The IDN standard (IDNA2008) requires NFC normalization before the Punycode conversion:
café.com (NFC: U+00E9) → xn--caf-dma.com
café.com (NFD: U+0065+U+0301) → Different Punycode!
If input is not NFC-normalized first, different Unicode representations of the same visual domain produce different Punycode strings.
Percent-Encoded Paths
URL paths can contain percent-encoded Unicode:
/caf%C3%A9 (NFC: é as UTF-8 bytes C3 A9)
/cafe%CC%81 (NFD: e + combining acute as UTF-8 bytes 65 CC 81)
These are different URLs pointing to potentially different resources, even though they look identical when decoded.
URL Comparison Best Practice
To reliably compare URLs:
- Decode percent-encoding
- Normalize Unicode to NFC
- Re-encode for comparison
- Lowercase the scheme and host (per RFC 3986)
function normalizeUrl(url) {
const decoded = decodeURIComponent(url);
const normalized = decoded.normalize("NFC");
return encodeURI(normalized);
}
Security Implications
Attackers can use different normalization forms to create URLs that look identical but point to different resources, enabling phishing attacks and cache poisoning.
Use Case
Important for web crawlers, URL deduplication systems, CDN cache key generation, and security tools that need to detect equivalent URLs. Also critical for internationalized web applications handling user-provided URLs.