Unicode Code Points in Hexadecimal (U+xxxx)

Q: What is Unicode Code Points in Hexadecimal (U+xxxx)?

Unicode assigns every character a unique number called a code point, written in the notation U+XXXX where XXXX is the hexadecimal value. Understanding the relationship between Unicode code points and their encoded byte sequences is essential for working with text data in hex editors. Unicode code point ranges: | Range | Name | Example | |-------|------|---------| | U+0000 – U+007F | Basic Latin (ASCII) | U+0041 = A | | U+0080 – U+00FF | Latin-1 Supplement | U+00E9 = é | | U+0100 – U+024F | Lat

Learn how Unicode code points are represented in hexadecimal notation (U+xxxx) and how they map to UTF-8, UTF-16, and UTF-32 byte sequences in hex editors.

Encoding

Hex

U+0041 = 41, U+00E9 = C3 A9, U+1F600 = F0 9F 98 80

ASCII

A, e with accent (é), grinning face emoji

Detailed Explanation

Unicode assigns every character a unique number called a code point, written in the notation U+XXXX where XXXX is the hexadecimal value. Understanding the relationship between Unicode code points and their encoded byte sequences is essential for working with text data in hex editors.

Unicode code point ranges:

Range	Name	Example
U+0000 – U+007F	Basic Latin (ASCII)	U+0041 = A
U+0080 – U+00FF	Latin-1 Supplement	U+00E9 = é
U+0100 – U+024F	Latin Extended	U+0148 = ň
U+0370 – U+03FF	Greek and Coptic	U+03B1 = α
U+4E00 – U+9FFF	CJK Unified Ideographs	U+4E16 = 世
U+1F600 – U+1F64F	Emoticons	U+1F600 = 😀

Code point vs. encoding:

A code point is an abstract number. The actual bytes stored in a file depend on the encoding used:

UTF-8 encoding in hex:

U+0041 (A) → 41 (1 byte — same as ASCII)
U+00E9 (é) → C3 A9 (2 bytes)
U+4E16 (世) → E4 B8 96 (3 bytes)
U+1F600 (😀) → F0 9F 98 80 (4 bytes)

UTF-16 encoding in hex:

U+0041 (A) → 00 41 (2 bytes, big-endian) or 41 00 (little-endian)
U+00E9 (é) → 00 E9 or E9 00
U+4E16 (世) → 4E 16 or 16 4E
U+1F600 (😀) → D8 3D DE 00 (surrogate pair, big-endian, 4 bytes)

UTF-32 encoding in hex:

Every character is exactly 4 bytes
U+0041 → 00 00 00 41 (big-endian)
U+1F600 → 00 01 F6 00 (big-endian)

Reading Unicode in hex editors:

When examining a text file in a hex editor, the bytes you see depend on the encoding. A file containing just "é" looks different in each encoding:

UTF-8: C3 A9 (2 bytes, no BOM)
UTF-8 with BOM: EF BB BF C3 A9 (5 bytes)
UTF-16BE: FE FF 00 E9 (4 bytes with BOM)
UTF-16LE: FF FE E9 00 (4 bytes with BOM)
UTF-32BE: 00 00 FE FF 00 00 00 E9 (8 bytes with BOM)

Escape sequences in programming:

Different languages use different syntax for Unicode escape sequences:

JavaScript: \u0041 or \u{1F600}
Python: \u0041 or \U0001F600
HTML: A or A (decimal)
CSS: \0041
JSON: \u0041 (only BMP, surrogate pairs needed for higher)

Practical debugging tip:

If a text file looks correct in one editor but garbled in another, open it in a hex editor. Check for a BOM at the start, then examine the byte sequences. If you see C3 or C2 bytes before characters that should be simple Latin letters, the file is likely UTF-8 being misinterpreted. If you see 00 bytes between ASCII characters, the file is probably UTF-16.

Use Case

Understanding Unicode hex encoding is essential when debugging character rendering issues across platforms, implementing text processing in internationalized applications, or analyzing binary protocol payloads that contain encoded text.

Try It — Hex Editor

Open full tool →

Unicode Code Points in Hexadecimal (U+xxxx)

Hex

ASCII

Detailed Explanation

Use Case

Try It — Hex Editor

Related Topics