CJK Unified Ideographs — Chinese, Japanese, Korean Characters

Q: CJK Unified Ideographs — Chinese, Japanese, Korean Characters

## CJK Unified Ideographs The CJK Unified Ideographs block (U+4E00 to U+9FFF) is one of the largest in Unicode, containing over 20,000 characters shared across Chinese, Japanese (Kanji), and Korean (Hanja) writing systems. Additional extensions (Ext. A through Ext. I) push the total beyond 90,000 ideographs. ### Main CJK Blocks | Block | Range | Count | Plane | |-------|-------|-------|-------| | CJK Unified Ideographs | U+4E00–U+9FFF | ~20,992 | BMP | | CJK Extension A | U+3400–U+4DBF | ~6,5

Learn about CJK Unified Ideographs in Unicode — their code point ranges, 3-byte UTF-8 encoding, and how Chinese, Japanese, and Korean share the same character set.

CJK Characters

Detailed Explanation

CJK Unified Ideographs

The CJK Unified Ideographs block (U+4E00 to U+9FFF) is one of the largest in Unicode, containing over 20,000 characters shared across Chinese, Japanese (Kanji), and Korean (Hanja) writing systems. Additional extensions (Ext. A through Ext. I) push the total beyond 90,000 ideographs.

Main CJK Blocks

Block	Range	Count	Plane
CJK Unified Ideographs	U+4E00–U+9FFF	~20,992	BMP
CJK Extension A	U+3400–U+4DBF	~6,592	BMP
CJK Extension B	U+20000–U+2A6DF	~42,720	SIP (Plane 2)
CJK Extension C–I	Various	~50,000+	SIP/TIP

UTF-8 Encoding

Characters in the main CJK block (BMP) use 3 bytes in UTF-8. For example:

世 (world) = U+4E16 → UTF-8: E4 B8 96
字 (character) = U+5B57 → UTF-8: E5 AD 97
韓 (Korean) = U+97D3 → UTF-8: E9 9F 93

Extension B and later characters reside on supplementary planes and require 4 bytes in UTF-8 and a surrogate pair in UTF-16.

Han Unification

Unicode's Han Unification principle assigns a single code point to characters that are considered the same across Chinese, Japanese, and Korean, even when their visual forms differ slightly by region. For example, the character for "bone" may look subtly different in a Chinese font versus a Japanese font, but both use U+9AA8.

Character Names

CJK ideographs are named using the pattern CJK UNIFIED IDEOGRAPH-XXXX where XXXX is the hex code point. Unlike Latin characters, they do not have individual descriptive names in the Unicode standard. The Unicode Inspector follows this convention.

Practical Implications

When working with CJK text, remember that each character occupies 3 bytes in UTF-8 compared to 1 byte for ASCII. A 100-character Japanese sentence uses approximately 300 bytes in UTF-8. Database column sizing, API payload limits, and text truncation logic must account for this difference.

Use Case

Use this when calculating storage requirements for multilingual databases, debugging character encoding issues in CJK text, or understanding why string length and byte length differ significantly for Chinese, Japanese, and Korean content.

Try It — Unicode Inspector

Open full tool →