Latin Accented Characters — Diacritics in Unicode

Learn about Latin accented characters in Unicode — pre-composed vs. combining forms, their 2-byte UTF-8 encoding, normalization (NFC/NFD), and common encoding pitfalls.

Encoding Issues

Detailed Explanation

Latin Accented Characters in Unicode

Accented Latin characters (e.g. é, à, ü, ñ, ç) are among the most common non-ASCII characters in Western languages. Unicode provides two ways to represent them, which creates both flexibility and complexity.

Pre-composed vs. Combining Forms

Unicode offers two representations for the same visual character:

Pre-composed (NFC — Canonical Decomposition followed by Canonical Composition):

  • é = U+00E9 (LATIN SMALL LETTER E WITH ACUTE) — single code point, 2 UTF-8 bytes

Combining (NFD — Canonical Decomposition):

  • é = U+0065 + U+0301 (LATIN SMALL LETTER E + COMBINING ACUTE ACCENT) — 2 code points, 3 UTF-8 bytes

Both render identically, but they differ in byte representation, string length, and comparison behavior.

Common Accented Characters

Character Code Point UTF-8 Bytes Name
é U+00E9 C3 A9 LATIN SMALL LETTER E WITH ACUTE
è U+00E8 C3 A8 LATIN SMALL LETTER E WITH GRAVE
à U+00E0 C3 A0 LATIN SMALL LETTER A WITH GRAVE
ü U+00FC C3 BC LATIN SMALL LETTER U WITH DIAERESIS
ñ U+00F1 C3 B1 LATIN SMALL LETTER N WITH TILDE
ç U+00E7 C3 A7 LATIN SMALL LETTER C WITH CEDILLA
ö U+00F6 C3 B6 LATIN SMALL LETTER O WITH DIAERESIS
å U+00E5 C3 A5 LATIN SMALL LETTER A WITH RING ABOVE
ß U+00DF C3 9F LATIN SMALL LETTER SHARP S

The Latin-1 Supplement Block

Pre-composed accented characters from Western European languages occupy the Latin-1 Supplement block (U+0080–U+00FF). These all use exactly 2 bytes in UTF-8, making them more compact than their decomposed equivalents.

Normalization Matters

String comparison and search must account for normalization:

"caf\u00E9" !== "cafe\u0301"  // Different byte sequences!
"caf\u00E9".normalize("NFC") === "cafe\u0301".normalize("NFC")  // true

The Unicode Inspector shows whether your text uses pre-composed or combining forms, helping you diagnose comparison and search failures.

Mojibake

When UTF-8 encoded accented text is read as Latin-1/ISO-8859-1, you get mojibake: é (2 bytes: C3 A9) is misinterpreted as é (two Latin-1 characters). The Unicode Inspector reveals the actual code points in such garbled text, making it easier to identify the encoding mismatch.

Use Case

Use this when debugging encoding issues with accented text in multilingual applications, understanding why string comparisons fail for text with diacritics, diagnosing mojibake in data imported from systems with different character encodings, or choosing between NFC and NFD normalization for your database.

Try It — Unicode Inspector

Open full tool