CJK Unified Ideographs — Chinese, Japanese, Korean Characters
Learn about CJK Unified Ideographs in Unicode — their code point ranges, 3-byte UTF-8 encoding, and how Chinese, Japanese, and Korean share the same character set.
Detailed Explanation
CJK Unified Ideographs
The CJK Unified Ideographs block (U+4E00 to U+9FFF) is one of the largest in Unicode, containing over 20,000 characters shared across Chinese, Japanese (Kanji), and Korean (Hanja) writing systems. Additional extensions (Ext. A through Ext. I) push the total beyond 90,000 ideographs.
Main CJK Blocks
| Block | Range | Count | Plane |
|---|---|---|---|
| CJK Unified Ideographs | U+4E00–U+9FFF | ~20,992 | BMP |
| CJK Extension A | U+3400–U+4DBF | ~6,592 | BMP |
| CJK Extension B | U+20000–U+2A6DF | ~42,720 | SIP (Plane 2) |
| CJK Extension C–I | Various | ~50,000+ | SIP/TIP |
UTF-8 Encoding
Characters in the main CJK block (BMP) use 3 bytes in UTF-8. For example:
世(world) = U+4E16 → UTF-8:E4 B8 96字(character) = U+5B57 → UTF-8:E5 AD 97韓(Korean) = U+97D3 → UTF-8:E9 9F 93
Extension B and later characters reside on supplementary planes and require 4 bytes in UTF-8 and a surrogate pair in UTF-16.
Han Unification
Unicode's Han Unification principle assigns a single code point to characters that are considered the same across Chinese, Japanese, and Korean, even when their visual forms differ slightly by region. For example, the character for "bone" may look subtly different in a Chinese font versus a Japanese font, but both use U+9AA8.
Character Names
CJK ideographs are named using the pattern CJK UNIFIED IDEOGRAPH-XXXX where XXXX is the hex code point. Unlike Latin characters, they do not have individual descriptive names in the Unicode standard. The Unicode Inspector follows this convention.
Practical Implications
When working with CJK text, remember that each character occupies 3 bytes in UTF-8 compared to 1 byte for ASCII. A 100-character Japanese sentence uses approximately 300 bytes in UTF-8. Database column sizing, API payload limits, and text truncation logic must account for this difference.
Use Case
Use this when calculating storage requirements for multilingual databases, debugging character encoding issues in CJK text, or understanding why string length and byte length differ significantly for Chinese, Japanese, and Korean content.