UTF-32: Fixed-Width Encoding Explained
Learn why UTF-32 uses exactly 4 bytes per code point, when it is useful (internal processing), and why it is never used for storage or transmission.
Detailed Explanation
UTF-32: The Simple (But Wasteful) Encoding
UTF-32 is the simplest Unicode encoding: every code point is stored in exactly 4 bytes (32 bits). No variable-length encoding, no surrogate pairs, no multi-byte sequences.
Size Comparison
"A" (U+0041)
UTF-8: 1 byte [41]
UTF-16: 2 bytes [00 41]
UTF-32: 4 bytes [00 00 00 41]
"é" (U+00E9)
UTF-8: 2 bytes [C3 A9]
UTF-16: 2 bytes [00 E9]
UTF-32: 4 bytes [00 00 00 E9]
"東" (U+6771)
UTF-8: 3 bytes [E6 9D B1]
UTF-16: 2 bytes [67 71]
UTF-32: 4 bytes [00 00 67 71]
"😀" (U+1F600)
UTF-8: 4 bytes [F0 9F 98 80]
UTF-16: 4 bytes [D8 3D DE 00]
UTF-32: 4 bytes [00 01 F6 00]
Advantages of UTF-32
- O(1) random access:
string[n]directly gives the nth code point - Simple length calculation: Byte length / 4 = code point count
- No partial character reads: Every 4-byte boundary is a character boundary
- Simplified algorithms: String processing code is much simpler
Disadvantages of UTF-32
- Massive space waste: ASCII text uses 4x the storage of UTF-8
- No grapheme awareness: UTF-32 still does not solve the grapheme vs code point problem
- Not web-compatible: No browser, HTTP standard, or JSON spec uses UTF-32
- Endianness issues: Needs BOM or explicit byte-order specification
Where UTF-32 Is Used
- Internal processing in Python: Python 3 uses UTF-32 (or a compact variant) internally for some string operations
- ICU library: The International Components for Unicode uses UTF-32 for many algorithms
- Text rendering engines: Some font rendering pipelines work in UTF-32 internally
- Academic/research: Simplifies Unicode algorithm implementation
Why Not for Storage?
For a 1 MB English text file:
- UTF-8: ~1 MB
- UTF-16: ~2 MB
- UTF-32: ~4 MB
The 4x storage overhead of UTF-32 makes it impractical for storage, transmission, or any external format. It exists purely as an internal processing convenience.
Use Case
When implementing text processing algorithms that need simple O(1) indexing by code point, or when working with Unicode normalization and collation algorithms internally, UTF-32 simplifies the code at the cost of memory. It should never be used for storage or data exchange.