CJK Character Length: Chinese, Japanese, Korean Text
Learn how Chinese, Japanese, and Korean characters affect string length in UTF-8 (3 bytes each), UTF-16 (2 bytes each), and UTF-32 encodings.
Detailed Explanation
CJK Characters: 3 Bytes in UTF-8
Chinese, Japanese (Kanji/Hiragana/Katakana), and Korean (Hangul) characters occupy the Unicode range U+4E00–U+9FFF (CJK Unified Ideographs) and related blocks. These characters require 3 bytes each in UTF-8.
Example String
東京都渋谷区 (Tokyo Shibuya-ku)
Length Measurements
| Metric | Value |
|---|---|
JavaScript .length |
5 |
| Code points | 5 |
| Grapheme clusters | 5 |
| UTF-8 bytes | 15 |
| UTF-16 bytes | 10 |
| UTF-32 bytes | 20 |
UTF-8 vs UTF-16 for CJK
This is one of the rare cases where UTF-16 is more compact than UTF-8. Each CJK character costs 3 bytes in UTF-8 but only 2 bytes in UTF-16. For text that is predominantly CJK, UTF-16 saves about 33% storage compared to UTF-8.
However, UTF-8 is still preferred for web content because:
- Mixed content (CJK + ASCII) is common, and ASCII characters are 1 byte in UTF-8 vs 2 in UTF-16
- UTF-8 is the standard encoding for HTML, JSON, and HTTP
- UTF-8 has no byte-order issues (no BOM needed)
Japanese Mixed Text
Japanese text typically mixes Kanji, Hiragana, Katakana, and ASCII:
こんにちは世界!Hello!
Here, Hiragana (こんにちは) and Kanji (世界) are 3 bytes each in UTF-8, the full-width exclamation (!) is 3 bytes, while "Hello!" is 6 bytes. The total UTF-8 size is much larger than the character count suggests.
Database Considerations
MySQL's utf8 encoding (deprecated) only supports up to 3 bytes per character, which covers basic CJK. However, utf8mb4 (4 bytes) is required for emoji and supplementary CJK characters. Always use utf8mb4 for modern applications.
Use Case
When building applications for Asian markets or handling multilingual content, knowing that CJK characters use 3 UTF-8 bytes each is essential for accurate storage planning, API payload size estimation, and database column sizing.