UTF-8 Multibyte Hex Encoding

Understand how UTF-8 encodes multibyte characters as hex sequences. Learn the byte patterns for 2-byte, 3-byte, and 4-byte encodings with examples.

Encoding

Hex

E4 B8 96 E7 95 8C

ASCII

世界

Detailed Explanation

UTF-8 is a variable-length character encoding that represents Unicode code points using one to four bytes. While standard ASCII characters (U+0000 to U+007F) require only a single byte, characters from other scripts — Chinese, Japanese, Arabic, emoji — require multiple bytes. Understanding the hex representation of these multibyte sequences is critical for debugging encoding issues, parsing binary data, and working with internationalized text.

UTF-8 byte patterns:

Code Point Range	Bytes	Binary Pattern	Hex Range
U+0000 – U+007F	1	`0xxxxxxx`	`00` – `7F`
U+0080 – U+07FF	2	`110xxxxx 10xxxxxx`	`C2 80` – `DF BF`
U+0800 – U+FFFF	3	`1110xxxx 10xxxxxx 10xxxxxx`	`E0 A0 80` – `EF BF BF`
U+10000 – U+10FFFF	4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	`F0 90 80 80` – `F4 8F BF BF`

Encoding example — "世" (U+4E16):

The code point U+4E16 falls in the 3-byte range (U+0800 – U+FFFF).

Convert 4E16 to binary: 0100 1110 0001 0110
Split into the 3-byte template 1110xxxx 10xxxxxx 10xxxxxx:
- Take the top 4 bits: 0100 → first byte: 1110 0100 = E4
- Next 6 bits: 111000 → second byte: 10 111000 = B8
- Last 6 bits: 010110 → third byte: 10 010110 = 96
Result: E4 B8 96

Recognizing UTF-8 in hex dumps:

When viewing binary data in a hex editor, you can identify UTF-8 multibyte sequences by their leading byte patterns:

Bytes starting with C2–DF begin a 2-byte sequence (Latin extended, Greek, Cyrillic, etc.)
Bytes starting with E0–EF begin a 3-byte sequence (CJK characters, most symbols)
Bytes starting with F0–F4 begin a 4-byte sequence (emoji, rare scripts, supplementary CJK)
Bytes starting with 80–BF are continuation bytes and should never appear at the start of a character

Common debugging scenarios:

Mojibake (garbled text) typically occurs when UTF-8 bytes are interpreted as a different encoding (e.g., Latin-1 or Windows-1252). In a hex editor, the underlying bytes will look correct — the problem lies in the interpretation layer. If you see sequences like C3 A9 being displayed as "Ã©" instead of "é", the file is UTF-8 but being read as Latin-1.

Use Case

Understanding UTF-8 hex encoding is essential when debugging character encoding issues in web applications, parsing internationalized text in binary protocols, or analyzing file contents that contain non-ASCII characters.

Try It — Hex Editor

Open full tool →

UTF-8 Multibyte Hex Encoding

Hex

ASCII

Detailed Explanation

Use Case

Try It — Hex Editor

Related Topics