Latin Extended Characters and Multi-Byte UTF-8

Learn how accented Latin characters like e with acute, u with umlaut, and n with tilde affect string length in UTF-8 and other encodings.

Basic Counting

Detailed Explanation

Beyond ASCII: Latin Extended Characters

Characters like é, ü, ñ, ç, and å are common in European languages. While they look like single characters, their encoding details reveal important differences from plain ASCII.

Example String

café naïve résumé

Length Measurements

Metric Value
JavaScript .length 17
Code points 17
Grapheme clusters 17
UTF-8 bytes 21
UTF-16 bytes 34
UTF-32 bytes 68

Why UTF-8 Bytes Differ

Characters in the Latin-1 Supplement range (U+0080 to U+00FF) require 2 bytes in UTF-8 instead of 1. The string "café naïve résumé" has 4 accented characters (é, ï, é, é), each costing 2 UTF-8 bytes, while the remaining 13 ASCII characters cost 1 byte each. Total: 13 + (4 × 2) = 21 bytes.

Precomposed vs Decomposed Forms

The character é can be stored two ways:

  • Precomposed (NFC): U+00E9 (a single code point, é)
  • Decomposed (NFD): U+0065 + U+0301 (two code points, e + combining acute accent)

Both render identically, but their .length and byte counts differ. The decomposed form has more code points and more bytes. This is why Unicode normalization matters before comparing or measuring strings.

Database Implications

If your database column is VARCHAR(255) measured in characters, both forms fit. But if it is measured in bytes (as in older MySQL configurations), the decomposed form uses more storage. Always check whether your DB counts characters or bytes.

Use Case

When building applications for European markets (French, German, Spanish, Portuguese), understanding that accented characters use 2 bytes in UTF-8 is essential for accurate storage estimation and VARCHAR limit calculations.

Try It — String Length Calculator

Open full tool