Precomposed vs Decomposed Unicode Characters
Understand the difference between precomposed (single code point) and decomposed (base + combining mark) Unicode characters, and how normalization converts between them.
Detailed Explanation
Precomposed vs Decomposed Characters
Unicode provides two ways to represent many accented characters: as a single precomposed code point, or as a decomposed sequence of base character plus combining marks.
Examples of Both Representations
| Character | Precomposed | Decomposed |
|---|---|---|
| é | U+00E9 (1 code point) | U+0065 + U+0301 (2 code points) |
| ö | U+00F6 (1 code point) | U+006F + U+0308 (2 code points) |
| ç | U+00E7 (1 code point) | U+0063 + U+0327 (2 code points) |
| Å | U+00C5 (1 code point) | U+0041 + U+030A (2 code points) |
Where Each Form Comes From
Precomposed characters typically come from:
- Direct keyboard input on Windows and Linux
- Copy-paste from legacy encodings (ISO-8859-1, Windows-1252)
- NFC normalization
Decomposed characters typically come from:
- macOS file system (APFS/HFS+ uses NFD)
- Some input methods on mobile devices
- Text generated by certain programming libraries
- NFD normalization
The String Length Problem
"é".length // 1 (precomposed)
"é".length // 2 (decomposed)
"é" === "é" // false!
"é".normalize("NFC") === "é".normalize("NFC") // true
This is a common source of bugs: string.length gives different results depending on which form is used, even though both represent the same visible character.
Which to Choose?
NFC (precomposed) is generally preferred because:
- Shorter byte representation
- More compatible with legacy systems
- W3C recommendation for web content
- More predictable
string.lengthbehavior
Use Case
Directly relevant to anyone building text processing, search, or file handling systems. macOS developers frequently encounter this issue because the file system returns NFD-normalized filenames, while files created on Windows use NFC. Cross-platform applications must handle both forms.