Unicode Combining Characters and Normalization
Learn how Unicode combining characters work, how they interact with normalization forms, and why they matter for text processing and display.
Detailed Explanation
Combining Characters in Unicode
Combining characters are Unicode characters that are intended to modify the appearance of the preceding base character. They do not stand alone — they "combine" with the character before them.
Common Combining Characters
| Code Point | Name | Example |
|---|---|---|
| U+0300 | COMBINING GRAVE ACCENT | à → à |
| U+0301 | COMBINING ACUTE ACCENT | é → é |
| U+0302 | COMBINING CIRCUMFLEX ACCENT | ô → ô |
| U+0303 | COMBINING TILDE | ñ → ñ |
| U+0308 | COMBINING DIAERESIS | ü → ü |
| U+0327 | COMBINING CEDILLA | ç → ç |
Multiple Combining Characters
A single base character can have multiple combining marks:
a + ̈ (diaeresis) + ́ (acute) = ä́
How Normalization Handles Them
- NFC: Combines base + combining mark into precomposed form (if one exists)
- NFD: Separates precomposed characters into base + combining marks
- Canonical ordering: When multiple combining marks are present, NFD/NFC sort them by their Canonical Combining Class (CCC) value
Why This Matters
If you have two strings — one using a precomposed character and one using a base + combining mark — they will not be equal in a byte comparison unless you normalize them first. This is the most common reason to use normalization.
Use Case
Essential for developers working with multilingual text, especially languages with diacritics (French, German, Spanish, Vietnamese). Understanding combining characters prevents bugs in text search, sorting, and display across different platforms and browsers.