UTF-8 Multibyte Hex Encoding
Understand how UTF-8 encodes multibyte characters as hex sequences. Learn the byte patterns for 2-byte, 3-byte, and 4-byte encodings with examples.
Hex
E4 B8 96 E7 95 8C
ASCII
世界
Detailed Explanation
UTF-8 is a variable-length character encoding that represents Unicode code points using one to four bytes. While standard ASCII characters (U+0000 to U+007F) require only a single byte, characters from other scripts — Chinese, Japanese, Arabic, emoji — require multiple bytes. Understanding the hex representation of these multibyte sequences is critical for debugging encoding issues, parsing binary data, and working with internationalized text.
UTF-8 byte patterns:
| Code Point Range | Bytes | Binary Pattern | Hex Range |
|---|---|---|---|
| U+0000 – U+007F | 1 | 0xxxxxxx |
00 – 7F |
| U+0080 – U+07FF | 2 | 110xxxxx 10xxxxxx |
C2 80 – DF BF |
| U+0800 – U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
E0 A0 80 – EF BF BF |
| U+10000 – U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
F0 90 80 80 – F4 8F BF BF |
Encoding example — "世" (U+4E16):
The code point U+4E16 falls in the 3-byte range (U+0800 – U+FFFF).
- Convert 4E16 to binary:
0100 1110 0001 0110 - Split into the 3-byte template
1110xxxx 10xxxxxx 10xxxxxx:- Take the top 4 bits:
0100→ first byte:1110 0100=E4 - Next 6 bits:
111000→ second byte:10 111000=B8 - Last 6 bits:
010110→ third byte:10 010110=96
- Take the top 4 bits:
- Result:
E4 B8 96
Recognizing UTF-8 in hex dumps:
When viewing binary data in a hex editor, you can identify UTF-8 multibyte sequences by their leading byte patterns:
- Bytes starting with
C2–DFbegin a 2-byte sequence (Latin extended, Greek, Cyrillic, etc.) - Bytes starting with
E0–EFbegin a 3-byte sequence (CJK characters, most symbols) - Bytes starting with
F0–F4begin a 4-byte sequence (emoji, rare scripts, supplementary CJK) - Bytes starting with
80–BFare continuation bytes and should never appear at the start of a character
Common debugging scenarios:
Mojibake (garbled text) typically occurs when UTF-8 bytes are interpreted as a different encoding (e.g., Latin-1 or Windows-1252). In a hex editor, the underlying bytes will look correct — the problem lies in the interpretation layer. If you see sequences like C3 A9 being displayed as "é" instead of "é", the file is UTF-8 but being read as Latin-1.
Use Case
Understanding UTF-8 hex encoding is essential when debugging character encoding issues in web applications, parsing internationalized text in binary protocols, or analyzing file contents that contain non-ASCII characters.