Korean Hangul Unicode Normalization
Understand how Unicode normalization works with Korean Hangul syllables. Learn about Jamo decomposition, Hangul Syllable composition, and the algorithmic decomposition/composition process.
Detailed Explanation
Hangul and Unicode Normalization
Korean Hangul has a unique relationship with Unicode normalization because its composition and decomposition are defined algorithmically rather than through lookup tables.
Hangul Syllable Structure
A Hangul syllable consists of:
- Leading consonant (Choseong): e.g., ᄀ (HANGUL CHOSEONG KIYEOK, ㄱ)
- Vowel (Jungseong): e.g., ᅡ (HANGUL JUNGSEONG A, ㅏ)
- Optional trailing consonant (Jongseong): e.g., ᆨ (HANGUL JONGSEONG KIYEOK)
Precomposed Syllable Blocks
Unicode defines 11,172 precomposed Hangul syllable blocks (U+AC00 to U+D7A3). The syllable 가 (ᄀ + ᅡ) is U+AC00 (HANGUL SYLLABLE GA).
NFC vs NFD for Hangul
| Form | Result for 가 | Code Points |
|---|---|---|
| NFC | 가 | U+AC00 (1 code point) |
| NFD | 가 | U+1100 + U+1161 (2 code points) |
Algorithmic Composition
Unlike Latin characters where composition is table-based, Hangul uses a mathematical formula:
SBase = 0xAC00
LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7
LCount = 19, VCount = 21, TCount = 28
NCount = VCount * TCount = 588
syllableIndex = (L - LBase) * NCount + (V - VBase) * TCount + (T - TBase)
composedCodePoint = SBase + syllableIndex
Practical Impact
Most Korean text is already in NFC form (precomposed syllable blocks). However, text from certain input methods or text processing systems may use decomposed Jamo. Without normalization, string comparison and search of Korean text can fail silently.
NFKC and Hangul Compatibility Jamo
Unicode includes "compatibility Jamo" (U+3131–U+3163) that are separate from the composing Jamo. NFKC maps compatibility Jamo to their standard Jamo counterparts.
Use Case
Essential for developers building Korean-language applications, search engines indexing Korean content, and any system processing Korean text from mixed sources (web forms, OCR, file systems). Korean text from macOS filenames uses NFD Jamo, which must be normalized for comparison with standard NFC text.