UTF-8 vs UTF-16: Encoding Size Comparison

Compare UTF-8 and UTF-16 byte sizes for ASCII, Latin, CJK, and emoji text. Learn which encoding is more efficient for different types of content.

Encoding Comparison

Detailed Explanation

UTF-8 vs UTF-16: Which Is Smaller?

The answer depends entirely on the content. UTF-8 and UTF-16 use different byte widths for different Unicode ranges, so the most efficient encoding varies by text language and composition.

Byte Widths by Character Range

Unicode Range Characters UTF-8 UTF-16
U+0000–U+007F ASCII 1 byte 2 bytes
U+0080–U+07FF Latin Extended, Greek, Cyrillic, Arabic, Hebrew 2 bytes 2 bytes
U+0800–U+FFFF CJK, Devanagari, Thai, most symbols 3 bytes 2 bytes
U+10000–U+10FFFF Emoji, rare CJK, historic scripts 4 bytes 4 bytes

When UTF-8 Wins

UTF-8 is more compact for text that is predominantly ASCII or Latin Extended:

"Hello World" (11 chars)
  UTF-8:  11 bytes
  UTF-16: 22 bytes  (2× larger!)

For English, French, German, Spanish, and other Latin-based languages, UTF-8 is typically 30-50% smaller than UTF-16.

When UTF-16 Wins

UTF-16 is more compact for text that is predominantly CJK:

"東京都渋谷区" (5 chars)
  UTF-8:  15 bytes
  UTF-16: 10 bytes  (33% smaller!)

Break-Even Point

For mixed-script text, the break-even depends on the ratio of ASCII to CJK characters. Roughly:

  • More than 33% ASCII → UTF-8 wins
  • Less than 33% ASCII → UTF-16 wins

Real-World Examples

Content Type Typical UTF-8 / UTF-16 Ratio
English prose 0.5× (UTF-8 half the size)
JSON API response 0.6× (mostly ASCII keys + values)
Japanese blog post 1.2× (UTF-8 slightly larger)
Chinese document 1.4× (UTF-8 40% larger)
Emoji-heavy chat 1.0× (roughly equal)

Why UTF-8 Dominates the Web

Despite UTF-16 being smaller for some content, UTF-8 dominates because:

  1. ASCII compatibility: No overhead for HTML tags, JSON syntax, URLs
  2. No byte order: UTF-16 needs BOM (Byte Order Mark); UTF-8 does not
  3. Self-synchronizing: You can find character boundaries from any byte position
  4. Universal standard: HTML5, JSON, HTTP all specify UTF-8

Use Case

When optimizing storage, network payload size, or choosing database encoding for multilingual applications, understanding the UTF-8 vs UTF-16 tradeoff helps make informed decisions that can significantly reduce data transfer and storage costs.

Try It — String Length Calculator

Open full tool