Latin Accented Characters — Diacritics in Unicode
Learn about Latin accented characters in Unicode — pre-composed vs. combining forms, their 2-byte UTF-8 encoding, normalization (NFC/NFD), and common encoding pitfalls.
Detailed Explanation
Latin Accented Characters in Unicode
Accented Latin characters (e.g. é, à, ü, ñ, ç) are among the most common non-ASCII characters in Western languages. Unicode provides two ways to represent them, which creates both flexibility and complexity.
Pre-composed vs. Combining Forms
Unicode offers two representations for the same visual character:
Pre-composed (NFC — Canonical Decomposition followed by Canonical Composition):
- é = U+00E9 (LATIN SMALL LETTER E WITH ACUTE) — single code point, 2 UTF-8 bytes
Combining (NFD — Canonical Decomposition):
- é = U+0065 + U+0301 (LATIN SMALL LETTER E + COMBINING ACUTE ACCENT) — 2 code points, 3 UTF-8 bytes
Both render identically, but they differ in byte representation, string length, and comparison behavior.
Common Accented Characters
| Character | Code Point | UTF-8 Bytes | Name |
|---|---|---|---|
| é | U+00E9 | C3 A9 | LATIN SMALL LETTER E WITH ACUTE |
| è | U+00E8 | C3 A8 | LATIN SMALL LETTER E WITH GRAVE |
| à | U+00E0 | C3 A0 | LATIN SMALL LETTER A WITH GRAVE |
| ü | U+00FC | C3 BC | LATIN SMALL LETTER U WITH DIAERESIS |
| ñ | U+00F1 | C3 B1 | LATIN SMALL LETTER N WITH TILDE |
| ç | U+00E7 | C3 A7 | LATIN SMALL LETTER C WITH CEDILLA |
| ö | U+00F6 | C3 B6 | LATIN SMALL LETTER O WITH DIAERESIS |
| å | U+00E5 | C3 A5 | LATIN SMALL LETTER A WITH RING ABOVE |
| ß | U+00DF | C3 9F | LATIN SMALL LETTER SHARP S |
The Latin-1 Supplement Block
Pre-composed accented characters from Western European languages occupy the Latin-1 Supplement block (U+0080–U+00FF). These all use exactly 2 bytes in UTF-8, making them more compact than their decomposed equivalents.
Normalization Matters
String comparison and search must account for normalization:
"caf\u00E9" !== "cafe\u0301" // Different byte sequences!
"caf\u00E9".normalize("NFC") === "cafe\u0301".normalize("NFC") // true
The Unicode Inspector shows whether your text uses pre-composed or combining forms, helping you diagnose comparison and search failures.
Mojibake
When UTF-8 encoded accented text is read as Latin-1/ISO-8859-1, you get mojibake: é (2 bytes: C3 A9) is misinterpreted as é (two Latin-1 characters). The Unicode Inspector reveals the actual code points in such garbled text, making it easier to identify the encoding mismatch.
Use Case
Use this when debugging encoding issues with accented text in multilingual applications, understanding why string comparisons fail for text with diacritics, diagnosing mojibake in data imported from systems with different character encodings, or choosing between NFC and NFD normalization for your database.
Try It — Unicode Inspector
Related Topics
Byte Order Mark (BOM) and Encoding Markers in Unicode
Encoding Issues
Basic Latin Alphabet — A to Z in Unicode
Basic Characters
CJK Unified Ideographs — Chinese, Japanese, Korean Characters
CJK Characters
Currency Symbols in Unicode
Special Characters
Unicode Whitespace Characters — All Space Types
Basic Characters