UTF-8 vs UTF-16: Encoding Size Comparison
Compare UTF-8 and UTF-16 byte sizes for ASCII, Latin, CJK, and emoji text. Learn which encoding is more efficient for different types of content.
Detailed Explanation
UTF-8 vs UTF-16: Which Is Smaller?
The answer depends entirely on the content. UTF-8 and UTF-16 use different byte widths for different Unicode ranges, so the most efficient encoding varies by text language and composition.
Byte Widths by Character Range
| Unicode Range | Characters | UTF-8 | UTF-16 |
|---|---|---|---|
| U+0000–U+007F | ASCII | 1 byte | 2 bytes |
| U+0080–U+07FF | Latin Extended, Greek, Cyrillic, Arabic, Hebrew | 2 bytes | 2 bytes |
| U+0800–U+FFFF | CJK, Devanagari, Thai, most symbols | 3 bytes | 2 bytes |
| U+10000–U+10FFFF | Emoji, rare CJK, historic scripts | 4 bytes | 4 bytes |
When UTF-8 Wins
UTF-8 is more compact for text that is predominantly ASCII or Latin Extended:
"Hello World" (11 chars)
UTF-8: 11 bytes
UTF-16: 22 bytes (2× larger!)
For English, French, German, Spanish, and other Latin-based languages, UTF-8 is typically 30-50% smaller than UTF-16.
When UTF-16 Wins
UTF-16 is more compact for text that is predominantly CJK:
"東京都渋谷区" (5 chars)
UTF-8: 15 bytes
UTF-16: 10 bytes (33% smaller!)
Break-Even Point
For mixed-script text, the break-even depends on the ratio of ASCII to CJK characters. Roughly:
- More than 33% ASCII → UTF-8 wins
- Less than 33% ASCII → UTF-16 wins
Real-World Examples
| Content Type | Typical UTF-8 / UTF-16 Ratio |
|---|---|
| English prose | 0.5× (UTF-8 half the size) |
| JSON API response | 0.6× (mostly ASCII keys + values) |
| Japanese blog post | 1.2× (UTF-8 slightly larger) |
| Chinese document | 1.4× (UTF-8 40% larger) |
| Emoji-heavy chat | 1.0× (roughly equal) |
Why UTF-8 Dominates the Web
Despite UTF-16 being smaller for some content, UTF-8 dominates because:
- ASCII compatibility: No overhead for HTML tags, JSON syntax, URLs
- No byte order: UTF-16 needs BOM (Byte Order Mark); UTF-8 does not
- Self-synchronizing: You can find character boundaries from any byte position
- Universal standard: HTML5, JSON, HTTP all specify UTF-8
Use Case
When optimizing storage, network payload size, or choosing database encoding for multilingual applications, understanding the UTF-8 vs UTF-16 tradeoff helps make informed decisions that can significantly reduce data transfer and storage costs.