Unicode Code Points and Binary Encoding

Understand how Unicode code points map to binary through UTF-8 and UTF-16 encoding. Learn multi-byte sequences, surrogate pairs, and the BMP vs other planes.

Unicode CharactersBinary (UTF-8/UTF-16)Encoding

Detailed Explanation

Unicode assigns a unique code point (a number like U+0041 for 'A') to every character in every writing system. The challenge is encoding these code points into binary bytes efficiently. UTF-8 and UTF-16 are the two dominant encoding schemes.

UTF-8 encoding rules:

UTF-8 uses variable-length encoding from 1 to 4 bytes:

Code Point Range Byte 1 Byte 2 Byte 3 Byte 4
U+0000 to U+007F 0xxxxxxx -- -- --
U+0080 to U+07FF 110xxxxx 10xxxxxx -- --
U+0800 to U+FFFF 1110xxxx 10xxxxxx 10xxxxxx --
U+10000 to U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example — encoding the euro sign (U+20AC) in UTF-8:

  1. U+20AC = 0010 0000 1010 1100 in binary (16 bits)
  2. Falls in the 3-byte range (U+0800 to U+FFFF)
  3. Split bits into the template: 1110 0010 | 10 000010 | 10 101100
  4. Result: E2 82 AC in hex (3 bytes)

Why UTF-8 is dominant:

UTF-8 is backward compatible with ASCII — any ASCII character uses exactly one byte with the same value. This means existing ASCII text is automatically valid UTF-8. For English text, UTF-8 is space-efficient (1 byte per character), while still supporting every Unicode character. Over 98% of web pages use UTF-8 encoding.

UTF-16 and surrogate pairs:

UTF-16 uses 2 bytes for characters in the Basic Multilingual Plane (U+0000 to U+FFFF) and 4 bytes (a surrogate pair) for supplementary characters like emoji. JavaScript strings are internally UTF-16, which is why emoji like U+1F600 have a .length of 2 in JavaScript even though they appear as a single character.

Use Case

Internationalization engineers analyze UTF-8 byte sequences to debug character encoding issues that cause garbled text (mojibake) when data crosses system boundaries.

Try It — Number Base Converter

Open full tool