Japanese Kana and Unicode Normalization

Learn how Unicode normalization affects Japanese Hiragana and Katakana characters, particularly dakuten/handakuten combining marks and halfwidth/fullwidth Katakana conversions.

Language-Specific

Detailed Explanation

Japanese Kana and Normalization

Japanese text involves several Unicode normalization scenarios, primarily around voiced/semi-voiced marks (dakuten/handakuten) and halfwidth/fullwidth forms.

Dakuten and Handakuten

Voiced (゙ dakuten) and semi-voiced (゚ handakuten) marks can be either:

  • Precomposed: が (ga = U+304C, single code point)
  • Decomposed: か + ゙ (ka + combining dakuten, two code points)
Character NFC NFD
が (ga) U+304C U+304B + U+3099
だ (da) U+3060 U+305F + U+3099
ぱ (pa) U+3071 U+306F + U+309A

Halfwidth vs Fullwidth Katakana

Japanese text may contain halfwidth Katakana (U+FF65–U+FF9F), common in legacy systems and some file formats:

Halfwidth Fullwidth NFKC Result
カ (ka)
ガ (ga)

NFC/NFD preserve halfwidth forms. Only NFKC/NFKD convert halfwidth to fullwidth.

Practical Example

// Halfwidth katakana "ga"
const hw = "ガ";  // halfwidth ka + halfwidth dakuten

hw.normalize("NFC");   // "ガ" (unchanged)
hw.normalize("NFKC");  // "ガ" (fullwidth ga, precomposed)
hw.normalize("NFKD");  // "ガ" (fullwidth ka + combining dakuten)

Why This Matters

Japanese text from different sources (web, legacy databases, CSV exports, OCR) often mixes fullwidth and halfwidth forms. Without NFKC normalization, text comparison and search can miss matches between ガ and even though they represent the same character.

CJK Compatibility Ideographs

Unicode also includes CJK Compatibility Ideographs (U+F900–U+FAD9) that NFKC/NFKD map to their standard equivalents. These are rare in modern text but appear in some legacy databases.

Use Case

Critical for Japanese text processing, search engines serving Japanese users, e-commerce platforms handling Japanese product names, and any system processing Japanese text from legacy systems that may contain halfwidth Katakana. Also important for OCR output normalization.

Try It — Unicode Normalizer

Open full tool