Japanese Kana and Unicode Normalization
Learn how Unicode normalization affects Japanese Hiragana and Katakana characters, particularly dakuten/handakuten combining marks and halfwidth/fullwidth Katakana conversions.
Detailed Explanation
Japanese Kana and Normalization
Japanese text involves several Unicode normalization scenarios, primarily around voiced/semi-voiced marks (dakuten/handakuten) and halfwidth/fullwidth forms.
Dakuten and Handakuten
Voiced (゙ dakuten) and semi-voiced (゚ handakuten) marks can be either:
- Precomposed: が (ga = U+304C, single code point)
- Decomposed: か + ゙ (ka + combining dakuten, two code points)
| Character | NFC | NFD |
|---|---|---|
| が (ga) | U+304C | U+304B + U+3099 |
| だ (da) | U+3060 | U+305F + U+3099 |
| ぱ (pa) | U+3071 | U+306F + U+309A |
Halfwidth vs Fullwidth Katakana
Japanese text may contain halfwidth Katakana (U+FF65–U+FF9F), common in legacy systems and some file formats:
| Halfwidth | Fullwidth | NFKC Result |
|---|---|---|
| カ (ka) | カ | カ |
| ガ (ga) | ガ | ガ |
NFC/NFD preserve halfwidth forms. Only NFKC/NFKD convert halfwidth to fullwidth.
Practical Example
// Halfwidth katakana "ga"
const hw = "ガ"; // halfwidth ka + halfwidth dakuten
hw.normalize("NFC"); // "ガ" (unchanged)
hw.normalize("NFKC"); // "ガ" (fullwidth ga, precomposed)
hw.normalize("NFKD"); // "ガ" (fullwidth ka + combining dakuten)
Why This Matters
Japanese text from different sources (web, legacy databases, CSV exports, OCR) often mixes fullwidth and halfwidth forms. Without NFKC normalization, text comparison and search can miss matches between ガ and ガ even though they represent the same character.
CJK Compatibility Ideographs
Unicode also includes CJK Compatibility Ideographs (U+F900–U+FAD9) that NFKC/NFKD map to their standard equivalents. These are rare in modern text but appear in some legacy databases.
Use Case
Critical for Japanese text processing, search engines serving Japanese users, e-commerce platforms handling Japanese product names, and any system processing Japanese text from legacy systems that may contain halfwidth Katakana. Also important for OCR output normalization.