Language Detection Basics — Identifying Languages Programmatically

Q: Language Detection Basics — Identifying Languages Programmatically

## Detecting Languages Programmatically Language detection is the process of identifying which language a piece of text is written in. While the language code reference helps you *use* codes, detection helps you *assign* them. ### Detection Techniques #### 1. Unicode Script Detection The fastest method: identify the script (writing system) used in the text. javascript function detectScript(text) { if (/[\u0600-\u06FF]/.test(text)) return "Arabic"; if (/[\u0590-\u05FF]/.test(text)) retur

Overview of techniques for automatic language detection in text, including n-gram analysis, Unicode range detection, and browser APIs.

Internationalization

Detailed Explanation

Detecting Languages Programmatically

Language detection is the process of identifying which language a piece of text is written in. While the language code reference helps you use codes, detection helps you assign them.

Detection Techniques

1. Unicode Script Detection

The fastest method: identify the script (writing system) used in the text.

function detectScript(text) {
  if (/[\u0600-\u06FF]/.test(text)) return "Arabic";
  if (/[\u0590-\u05FF]/.test(text)) return "Hebrew";
  if (/[\u4E00-\u9FFF]/.test(text)) return "CJK";
  if (/[\u3040-\u309F]/.test(text)) return "Hiragana (Japanese)";
  if (/[\u30A0-\u30FF]/.test(text)) return "Katakana (Japanese)";
  if (/[\uAC00-\uD7AF]/.test(text)) return "Hangul (Korean)";
  if (/[\u0400-\u04FF]/.test(text)) return "Cyrillic";
  if (/[\u0370-\u03FF]/.test(text)) return "Greek";
  if (/[\u0E00-\u0E7F]/.test(text)) return "Thai";
  return "Latin or Unknown";
}

This cannot distinguish between languages sharing the same script (e.g., English vs French, Chinese vs Japanese kanji).

2. N-gram Frequency Analysis

Statistical approach comparing character n-gram frequencies against known language profiles:

Bigrams: "th", "he", "in" are frequent in English
Trigrams: "the", "and", "ing" are highly English-specific
Libraries like franc use this technique

3. Stop Word Detection

Check for language-specific common words:

const stopWords = {
  en: ["the", "is", "at", "which", "on"],
  fr: ["le", "la", "les", "de", "et"],
  de: ["der", "die", "das", "und", "ist"],
  es: ["el", "la", "los", "de", "en"],
  ja: ["の", "は", "を", "が", "で"],
};

4. Machine Learning

Modern approaches use neural networks trained on multilingual corpora. Libraries:

fastText (Facebook/Meta) — identifies 176 languages
langdetect (Python, based on Google's algorithm)
cld3 (Google Compact Language Detector v3)

Challenges

Short text (tweets, search queries) has low accuracy
Code-mixed text ("Spanglish", "Hinglish") confuses detectors
Similar languages (Norwegian/Danish/Swedish, Serbian/Croatian) are hard to distinguish
Romanized text (e.g., Japanese romaji) lacks script cues

Use Case

Language detection is used in search engines to route queries, in email clients to suggest translations, in content moderation to apply the right language filters, and in CMS platforms to auto-tag content. It helps assign the correct ISO 639 language code to user-generated content.

Try It — Language Code Reference

Open full tool →