Unicode Code Points in Hexadecimal (U+xxxx)

Learn how Unicode code points are represented in hexadecimal notation (U+xxxx) and how they map to UTF-8, UTF-16, and UTF-32 byte sequences in hex editors.

Encoding

Hex

U+0041 = 41, U+00E9 = C3 A9, U+1F600 = F0 9F 98 80

ASCII

A, e with accent (é), grinning face emoji

Detailed Explanation

Unicode assigns every character a unique number called a code point, written in the notation U+XXXX where XXXX is the hexadecimal value. Understanding the relationship between Unicode code points and their encoded byte sequences is essential for working with text data in hex editors.

Unicode code point ranges:

Range Name Example
U+0000 – U+007F Basic Latin (ASCII) U+0041 = A
U+0080 – U+00FF Latin-1 Supplement U+00E9 = é
U+0100 – U+024F Latin Extended U+0148 = ň
U+0370 – U+03FF Greek and Coptic U+03B1 = α
U+4E00 – U+9FFF CJK Unified Ideographs U+4E16 = 世
U+1F600 – U+1F64F Emoticons U+1F600 = 😀

Code point vs. encoding:

A code point is an abstract number. The actual bytes stored in a file depend on the encoding used:

UTF-8 encoding in hex:

  • U+0041 (A) → 41 (1 byte — same as ASCII)
  • U+00E9 (é) → C3 A9 (2 bytes)
  • U+4E16 (世) → E4 B8 96 (3 bytes)
  • U+1F600 (😀) → F0 9F 98 80 (4 bytes)

UTF-16 encoding in hex:

  • U+0041 (A) → 00 41 (2 bytes, big-endian) or 41 00 (little-endian)
  • U+00E9 (é) → 00 E9 or E9 00
  • U+4E16 (世) → 4E 16 or 16 4E
  • U+1F600 (😀) → D8 3D DE 00 (surrogate pair, big-endian, 4 bytes)

UTF-32 encoding in hex:

  • Every character is exactly 4 bytes
  • U+0041 → 00 00 00 41 (big-endian)
  • U+1F600 → 00 01 F6 00 (big-endian)

Reading Unicode in hex editors:

When examining a text file in a hex editor, the bytes you see depend on the encoding. A file containing just "é" looks different in each encoding:

  • UTF-8: C3 A9 (2 bytes, no BOM)
  • UTF-8 with BOM: EF BB BF C3 A9 (5 bytes)
  • UTF-16BE: FE FF 00 E9 (4 bytes with BOM)
  • UTF-16LE: FF FE E9 00 (4 bytes with BOM)
  • UTF-32BE: 00 00 FE FF 00 00 00 E9 (8 bytes with BOM)

Escape sequences in programming:

Different languages use different syntax for Unicode escape sequences:

  • JavaScript: \u0041 or \u{1F600}
  • Python: \u0041 or \U0001F600
  • HTML: A or A (decimal)
  • CSS: \0041
  • JSON: \u0041 (only BMP, surrogate pairs needed for higher)

Practical debugging tip:

If a text file looks correct in one editor but garbled in another, open it in a hex editor. Check for a BOM at the start, then examine the byte sequences. If you see C3 or C2 bytes before characters that should be simple Latin letters, the file is likely UTF-8 being misinterpreted. If you see 00 bytes between ASCII characters, the file is probably UTF-16.

Use Case

Understanding Unicode hex encoding is essential when debugging character rendering issues across platforms, implementing text processing in internationalized applications, or analyzing binary protocol payloads that contain encoded text.

Try It — Hex Editor

Open full tool