Latin Extended Characters and Multi-Byte UTF-8
Learn how accented Latin characters like e with acute, u with umlaut, and n with tilde affect string length in UTF-8 and other encodings.
Detailed Explanation
Beyond ASCII: Latin Extended Characters
Characters like é, ü, ñ, ç, and å are common in European languages. While they look like single characters, their encoding details reveal important differences from plain ASCII.
Example String
café naïve résumé
Length Measurements
| Metric | Value |
|---|---|
JavaScript .length |
17 |
| Code points | 17 |
| Grapheme clusters | 17 |
| UTF-8 bytes | 21 |
| UTF-16 bytes | 34 |
| UTF-32 bytes | 68 |
Why UTF-8 Bytes Differ
Characters in the Latin-1 Supplement range (U+0080 to U+00FF) require 2 bytes in UTF-8 instead of 1. The string "café naïve résumé" has 4 accented characters (é, ï, é, é), each costing 2 UTF-8 bytes, while the remaining 13 ASCII characters cost 1 byte each. Total: 13 + (4 × 2) = 21 bytes.
Precomposed vs Decomposed Forms
The character é can be stored two ways:
- Precomposed (NFC): U+00E9 (a single code point,
é) - Decomposed (NFD): U+0065 + U+0301 (two code points,
e+ combining acute accent)
Both render identically, but their .length and byte counts differ. The decomposed form has more code points and more bytes. This is why Unicode normalization matters before comparing or measuring strings.
Database Implications
If your database column is VARCHAR(255) measured in characters, both forms fit. But if it is measured in bytes (as in older MySQL configurations), the decomposed form uses more storage. Always check whether your DB counts characters or bytes.
Use Case
When building applications for European markets (French, German, Spanish, Portuguese), understanding that accented characters use 2 bytes in UTF-8 is essential for accurate storage estimation and VARCHAR limit calculations.