Unicode Code Points and Binary Encoding
Understand how Unicode code points map to binary through UTF-8 and UTF-16 encoding. Learn multi-byte sequences, surrogate pairs, and the BMP vs other planes.
Detailed Explanation
Unicode assigns a unique code point (a number like U+0041 for 'A') to every character in every writing system. The challenge is encoding these code points into binary bytes efficiently. UTF-8 and UTF-16 are the two dominant encoding schemes.
UTF-8 encoding rules:
UTF-8 uses variable-length encoding from 1 to 4 bytes:
| Code Point Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
|---|---|---|---|---|
| U+0000 to U+007F | 0xxxxxxx |
-- | -- | -- |
| U+0080 to U+07FF | 110xxxxx |
10xxxxxx |
-- | -- |
| U+0800 to U+FFFF | 1110xxxx |
10xxxxxx |
10xxxxxx |
-- |
| U+10000 to U+10FFFF | 11110xxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
Example — encoding the euro sign (U+20AC) in UTF-8:
- U+20AC =
0010 0000 1010 1100in binary (16 bits) - Falls in the 3-byte range (U+0800 to U+FFFF)
- Split bits into the template:
1110 0010|10 000010|10 101100 - Result:
E2 82 ACin hex (3 bytes)
Why UTF-8 is dominant:
UTF-8 is backward compatible with ASCII — any ASCII character uses exactly one byte with the same value. This means existing ASCII text is automatically valid UTF-8. For English text, UTF-8 is space-efficient (1 byte per character), while still supporting every Unicode character. Over 98% of web pages use UTF-8 encoding.
UTF-16 and surrogate pairs:
UTF-16 uses 2 bytes for characters in the Basic Multilingual Plane (U+0000 to U+FFFF) and 4 bytes (a surrogate pair) for supplementary characters like emoji. JavaScript strings are internally UTF-16, which is why emoji like U+1F600 have a .length of 2 in JavaScript even though they appear as a single character.
Use Case
Internationalization engineers analyze UTF-8 byte sequences to debug character encoding issues that cause garbled text (mojibake) when data crosses system boundaries.
Try It — Number Base Converter
Related Topics
Convert ASCII Characters to Binary
ASCII Text → Binary (7/8-bit)
Convert Hexadecimal to Decimal
Hexadecimal (Base 16) → Decimal (Base 10)
Convert Binary to Decimal
Binary (Base 2) → Decimal (Base 10)
How to Read a Hexdump
Raw Binary Data → Hexadecimal + ASCII
Base64 Encoding of Binary Data
Binary / Raw Bytes → Base64 (64 ASCII characters)