How Percent Encoding Works
A comprehensive guide to URL percent encoding (RFC 3986). Learn how characters are converted to %HH format, which characters need encoding, and why.
Character
N/A
Encoded
%HH
Detailed Explanation
Percent encoding (also called URL encoding) is the mechanism defined by RFC 3986 for representing characters in a URI that are not allowed in their raw form. Each encoded character is replaced by a triplet consisting of a percent sign (%) followed by two hexadecimal digits representing the character's byte value.
The encoding process:
- Determine if the character needs encoding (is it outside the unreserved set?)
- Convert the character to its byte representation using UTF-8
- For each byte, output
%followed by the byte's value in uppercase hexadecimal
Example: The space character (ASCII 32):
- Decimal: 32
- Hexadecimal: 20
- Percent-encoded:
%20
Character sets defined by RFC 3986:
Unreserved characters (never need encoding):
A-Z a-z 0-9 - _ . ~
Reserved characters (have special meaning, encode when used as data):
: / ? # [ ] @ ! $ & ' ( ) * + , ; =
All other characters (must always be encoded):
Spaces, control characters, non-ASCII characters, and characters like { } | \ ^ < >`
How multi-byte characters work: Characters outside ASCII are first converted to UTF-8 bytes, then each byte is percent-encoded:
é (e-acute) → UTF-8: 0xC3 0xA9 → %C3%A9
✓ (checkmark) → UTF-8: 0xE2 0x9C 0x93 → %E2%9C%93
Case sensitivity: RFC 3986 states that hexadecimal digits in percent-encoded triplets are case-insensitive (%2f and %2F are equivalent), but uppercase is recommended for consistency and is what JavaScript's encoding functions produce.
Decoding: Percent decoding reverses the process. The % is found, the following two hex digits are read, converted to a byte, and the bytes are assembled into characters using UTF-8. Invalid sequences (like %GG or a truncated %2) should be treated as errors.
Pitfall: A common misconception is that percent encoding is like HTML entity encoding or Base64 encoding. It is not. Percent encoding is URL-specific and only encodes individual bytes. Confusing it with other encoding schemes leads to double-encoding bugs or corrupted data. Always use the URL-specific encoding functions provided by your programming language.
Use Case
Understanding the foundational mechanism behind all URL encoding, essential for any developer working with web APIs, form submissions, or link generation.