Latin Accented Characters — Diacritics in Unicode

Q: Latin Accented Characters — Diacritics in Unicode

## Latin Accented Characters in Unicode Accented Latin characters (e.g. é, à, ü, ñ, ç) are among the most common non-ASCII characters in Western languages. Unicode provides two ways to represent them, which creates both flexibility and complexity. ### Pre-composed vs. Combining Forms Unicode offers two representations for the same visual character: Pre-composed (NFC — Canonical Decomposition followed by Canonical Composition): - é = U+00E9 (LATIN SMALL LETTER E WITH ACUTE) — single code poin

Learn about Latin accented characters in Unicode — pre-composed vs. combining forms, their 2-byte UTF-8 encoding, normalization (NFC/NFD), and common encoding pitfalls.

Encoding Issues

Detailed Explanation

Latin Accented Characters in Unicode

Accented Latin characters (e.g. é, à, ü, ñ, ç) are among the most common non-ASCII characters in Western languages. Unicode provides two ways to represent them, which creates both flexibility and complexity.

Pre-composed vs. Combining Forms

Unicode offers two representations for the same visual character:

Pre-composed (NFC — Canonical Decomposition followed by Canonical Composition):

é = U+00E9 (LATIN SMALL LETTER E WITH ACUTE) — single code point, 2 UTF-8 bytes

Combining (NFD — Canonical Decomposition):

é = U+0065 + U+0301 (LATIN SMALL LETTER E + COMBINING ACUTE ACCENT) — 2 code points, 3 UTF-8 bytes

Both render identically, but they differ in byte representation, string length, and comparison behavior.

Common Accented Characters

Character	Code Point	UTF-8 Bytes	Name
é	U+00E9	C3 A9	LATIN SMALL LETTER E WITH ACUTE
è	U+00E8	C3 A8	LATIN SMALL LETTER E WITH GRAVE
à	U+00E0	C3 A0	LATIN SMALL LETTER A WITH GRAVE
ü	U+00FC	C3 BC	LATIN SMALL LETTER U WITH DIAERESIS
ñ	U+00F1	C3 B1	LATIN SMALL LETTER N WITH TILDE
ç	U+00E7	C3 A7	LATIN SMALL LETTER C WITH CEDILLA
ö	U+00F6	C3 B6	LATIN SMALL LETTER O WITH DIAERESIS
å	U+00E5	C3 A5	LATIN SMALL LETTER A WITH RING ABOVE
ß	U+00DF	C3 9F	LATIN SMALL LETTER SHARP S

The Latin-1 Supplement Block

Pre-composed accented characters from Western European languages occupy the Latin-1 Supplement block (U+0080–U+00FF). These all use exactly 2 bytes in UTF-8, making them more compact than their decomposed equivalents.

Normalization Matters

String comparison and search must account for normalization:

"caf\u00E9" !== "cafe\u0301"  // Different byte sequences!
"caf\u00E9".normalize("NFC") === "cafe\u0301".normalize("NFC")  // true

The Unicode Inspector shows whether your text uses pre-composed or combining forms, helping you diagnose comparison and search failures.

Mojibake

When UTF-8 encoded accented text is read as Latin-1/ISO-8859-1, you get mojibake: é (2 bytes: C3 A9) is misinterpreted as Ã© (two Latin-1 characters). The Unicode Inspector reveals the actual code points in such garbled text, making it easier to identify the encoding mismatch.

Use Case

Use this when debugging encoding issues with accented text in multilingual applications, understanding why string comparisons fail for text with diacritics, diagnosing mojibake in data imported from systems with different character encodings, or choosing between NFC and NFD normalization for your database.

Try It — Unicode Inspector

Open full tool →