Unicode Normalization for URL Comparison

Understand how Unicode normalization affects URLs and Internationalized Domain Names (IDN). Learn why normalizing before comparison prevents duplicate detection failures.

Use Cases

Detailed Explanation

Normalization in URLs

URLs can contain Unicode characters in two ways: directly in Internationalized Domain Names (IDN) and percent-encoded in the path and query components. Normalization is critical for comparing and deduplicating URLs.

Internationalized Domain Names (IDN)

Domain names like café.com are converted to Punycode (xn--caf-dma.com) for DNS. The IDN standard (IDNA2008) requires NFC normalization before the Punycode conversion:

café.com (NFC: U+00E9)  →  xn--caf-dma.com
café.com (NFD: U+0065+U+0301)  →  Different Punycode!

If input is not NFC-normalized first, different Unicode representations of the same visual domain produce different Punycode strings.

Percent-Encoded Paths

URL paths can contain percent-encoded Unicode:

/caf%C3%A9    (NFC: é as UTF-8 bytes C3 A9)
/cafe%CC%81   (NFD: e + combining acute as UTF-8 bytes 65 CC 81)

These are different URLs pointing to potentially different resources, even though they look identical when decoded.

URL Comparison Best Practice

To reliably compare URLs:

  1. Decode percent-encoding
  2. Normalize Unicode to NFC
  3. Re-encode for comparison
  4. Lowercase the scheme and host (per RFC 3986)
function normalizeUrl(url) {
  const decoded = decodeURIComponent(url);
  const normalized = decoded.normalize("NFC");
  return encodeURI(normalized);
}

Security Implications

Attackers can use different normalization forms to create URLs that look identical but point to different resources, enabling phishing attacks and cache poisoning.

Use Case

Important for web crawlers, URL deduplication systems, CDN cache key generation, and security tools that need to detect equivalent URLs. Also critical for internationalized web applications handling user-provided URLs.

Try It — Unicode Normalizer

Open full tool