UTF-32: Fixed-Width Encoding Explained

Q: UTF-32: Fixed-Width Encoding Explained

## UTF-32: The Simple (But Wasteful) Encoding UTF-32 is the simplest Unicode encoding: every code point is stored in exactly 4 bytes (32 bits). No variable-length encoding, no surrogate pairs, no multi-byte sequences. ### Size Comparison "A" (U+0041) UTF-8: 1 byte [41] UTF-16: 2 bytes [00 41] UTF-32: 4 bytes [00 00 00 41] "é" (U+00E9) UTF-8: 2 bytes [C3 A9] UTF-16: 2 bytes [00 E9] UTF-32: 4 bytes [00 00 00 E9] "東" (U+6771) UTF-8: 3 bytes [E6 9D B1] UTF-16:

Learn why UTF-32 uses exactly 4 bytes per code point, when it is useful (internal processing), and why it is never used for storage or transmission.

Encoding Comparison

Detailed Explanation

UTF-32: The Simple (But Wasteful) Encoding

UTF-32 is the simplest Unicode encoding: every code point is stored in exactly 4 bytes (32 bits). No variable-length encoding, no surrogate pairs, no multi-byte sequences.

Size Comparison

"A" (U+0041)
  UTF-8:   1 byte   [41]
  UTF-16:  2 bytes  [00 41]
  UTF-32:  4 bytes  [00 00 00 41]

"é" (U+00E9)
  UTF-8:   2 bytes  [C3 A9]
  UTF-16:  2 bytes  [00 E9]
  UTF-32:  4 bytes  [00 00 00 E9]

"東" (U+6771)
  UTF-8:   3 bytes  [E6 9D B1]
  UTF-16:  2 bytes  [67 71]
  UTF-32:  4 bytes  [00 00 67 71]

"😀" (U+1F600)
  UTF-8:   4 bytes  [F0 9F 98 80]
  UTF-16:  4 bytes  [D8 3D DE 00]
  UTF-32:  4 bytes  [00 01 F6 00]

Advantages of UTF-32

O(1) random access: string[n] directly gives the nth code point
Simple length calculation: Byte length / 4 = code point count
No partial character reads: Every 4-byte boundary is a character boundary
Simplified algorithms: String processing code is much simpler

Disadvantages of UTF-32

Massive space waste: ASCII text uses 4x the storage of UTF-8
No grapheme awareness: UTF-32 still does not solve the grapheme vs code point problem
Not web-compatible: No browser, HTTP standard, or JSON spec uses UTF-32
Endianness issues: Needs BOM or explicit byte-order specification

Where UTF-32 Is Used

Internal processing in Python: Python 3 uses UTF-32 (or a compact variant) internally for some string operations
ICU library: The International Components for Unicode uses UTF-32 for many algorithms
Text rendering engines: Some font rendering pipelines work in UTF-32 internally
Academic/research: Simplifies Unicode algorithm implementation

Why Not for Storage?

For a 1 MB English text file:

UTF-8: ~1 MB
UTF-16: ~2 MB
UTF-32: ~4 MB

The 4x storage overhead of UTF-32 makes it impractical for storage, transmission, or any external format. It exists purely as an internal processing convenience.

Use Case

When implementing text processing algorithms that need simple O(1) indexing by code point, or when working with Unicode normalization and collation algorithms internally, UTF-32 simplifies the code at the cost of memory. It should never be used for storage or data exchange.

Try It — String Length Calculator

Open full tool →