UTF-8 Multibyte Hex Encoding

Understand how UTF-8 encodes multibyte characters as hex sequences. Learn the byte patterns for 2-byte, 3-byte, and 4-byte encodings with examples.

Encoding

Hex

E4 B8 96 E7 95 8C

ASCII

世界

Detailed Explanation

UTF-8 is a variable-length character encoding that represents Unicode code points using one to four bytes. While standard ASCII characters (U+0000 to U+007F) require only a single byte, characters from other scripts — Chinese, Japanese, Arabic, emoji — require multiple bytes. Understanding the hex representation of these multibyte sequences is critical for debugging encoding issues, parsing binary data, and working with internationalized text.

UTF-8 byte patterns:

Code Point Range Bytes Binary Pattern Hex Range
U+0000 – U+007F 1 0xxxxxxx 007F
U+0080 – U+07FF 2 110xxxxx 10xxxxxx C2 80DF BF
U+0800 – U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx E0 A0 80EF BF BF
U+10000 – U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx F0 90 80 80F4 8F BF BF

Encoding example — "世" (U+4E16):

The code point U+4E16 falls in the 3-byte range (U+0800 – U+FFFF).

  1. Convert 4E16 to binary: 0100 1110 0001 0110
  2. Split into the 3-byte template 1110xxxx 10xxxxxx 10xxxxxx:
    • Take the top 4 bits: 0100 → first byte: 1110 0100 = E4
    • Next 6 bits: 111000 → second byte: 10 111000 = B8
    • Last 6 bits: 010110 → third byte: 10 010110 = 96
  3. Result: E4 B8 96

Recognizing UTF-8 in hex dumps:

When viewing binary data in a hex editor, you can identify UTF-8 multibyte sequences by their leading byte patterns:

  • Bytes starting with C2DF begin a 2-byte sequence (Latin extended, Greek, Cyrillic, etc.)
  • Bytes starting with E0EF begin a 3-byte sequence (CJK characters, most symbols)
  • Bytes starting with F0F4 begin a 4-byte sequence (emoji, rare scripts, supplementary CJK)
  • Bytes starting with 80BF are continuation bytes and should never appear at the start of a character

Common debugging scenarios:

Mojibake (garbled text) typically occurs when UTF-8 bytes are interpreted as a different encoding (e.g., Latin-1 or Windows-1252). In a hex editor, the underlying bytes will look correct — the problem lies in the interpretation layer. If you see sequences like C3 A9 being displayed as "é" instead of "é", the file is UTF-8 but being read as Latin-1.

Use Case

Understanding UTF-8 hex encoding is essential when debugging character encoding issues in web applications, parsing internationalized text in binary protocols, or analyzing file contents that contain non-ASCII characters.

Try It — Hex Editor

Open full tool