Text Segmentation with Intl.Segmenter (Word, Sentence, Grapheme)
Break text into words, sentences, or grapheme clusters for any locale using Intl.Segmenter. Learn about CJK word boundaries, emoji segmentation, and locale-specific text splitting.
Detailed Explanation
Intl.Segmenter: Locale-Aware Text Segmentation
Intl.Segmenter splits text into meaningful segments (words, sentences, or grapheme clusters) according to locale-specific rules. This is particularly important for languages that do not use spaces between words.
Word Segmentation
// English: spaces separate words
const en = new Intl.Segmenter('en', { granularity: 'word' });
const words = [...en.segment('Hello world!')].filter(s => s.isWordLike);
// [{segment: "Hello"}, {segment: "world"}]
// Japanese: no spaces between words
const ja = new Intl.Segmenter('ja', { granularity: 'word' });
const jaWords = [...ja.segment('東京は日本の首都です')].filter(s => s.isWordLike);
// [{segment: "東京"}, {segment: "日本"}, {segment: "首都"}]
Why Spaces Are Not Enough
Many languages do not use spaces between words:
- Japanese: 私は学生です (watashi wa gakusei desu)
- Chinese: 我是学生 (wo shi xuesheng)
- Thai: ฉันเป็นนักเรียน (chan pen nak rian)
- Khmer, Lao, Myanmar: No word-boundary spaces
Sentence Segmentation
const seg = new Intl.Segmenter('en', { granularity: 'sentence' });
const text = 'Hello world. How are you? I am fine!';
const sentences = [...seg.segment(text)].map(s => s.segment);
// ["Hello world. ", "How are you? ", "I am fine!"]
Grapheme Cluster Segmentation
const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
// Emoji with modifiers count as one grapheme
const emoji = '👨👩👧👦'; // family emoji
[...seg.segment(emoji)].length; // 1 (not 7!)
// Accented characters
const accented = 'é'; // e + combining accent = é
[...seg.segment(accented)].length; // 1 (not 2!)
// Compare with string length
emoji.length; // 11 (UTF-16 code units)
accented.length; // 2 (UTF-16 code units)
Practical Uses
// Character counter that handles emoji correctly
function countGraphemes(text, locale = 'en') {
const seg = new Intl.Segmenter(locale, { granularity: 'grapheme' });
return [...seg.segment(text)].length;
}
countGraphemes('Hello 😀'); // 7 (not 8)
countGraphemes('👨👩👧👦😊'); // 2
// Word count for CJK text
function countWords(text, locale) {
const seg = new Intl.Segmenter(locale, { granularity: 'word' });
return [...seg.segment(text)].filter(s => s.isWordLike).length;
}
countWords('東京は日本の首都です', 'ja'); // 3
Browser Support
Intl.Segmenter is supported in Chrome 87+, Edge 87+, Safari 15.4+, and Firefox 125+. It is relatively newer than other Intl APIs.
Use Case
Text segmentation is essential for search engines processing CJK text, word counters that need accurate counts for Japanese and Chinese, text editors implementing word-boundary navigation, spell checkers for non-space-separated languages, and any application that needs to correctly count characters including emoji. A Twitter-like character counter that uses string.length will count a family emoji as 11 characters instead of 1. A search engine indexing Japanese text needs word boundaries to create an inverted index.
Try It — Locale String Tester
Related Topics
Collation and Sorting by Locale with Intl.Collator
Intl.Collator
Plural Rules by Language with Intl.PluralRules
Intl.PluralRules
List Formatting with Intl.ListFormat (And, Or, Unit)
Intl.ListFormat
Locale Fallback Chain: How Browsers Resolve Locale Requests
Advanced
Intl.DisplayNames API: Language, Region, Script, and Currency Names
Intl.DisplayNames