Language Detection Basics — Identifying Languages Programmatically

Overview of techniques for automatic language detection in text, including n-gram analysis, Unicode range detection, and browser APIs.

Internationalization

Detailed Explanation

Detecting Languages Programmatically

Language detection is the process of identifying which language a piece of text is written in. While the language code reference helps you use codes, detection helps you assign them.

Detection Techniques

1. Unicode Script Detection

The fastest method: identify the script (writing system) used in the text.

function detectScript(text) {
  if (/[\u0600-\u06FF]/.test(text)) return "Arabic";
  if (/[\u0590-\u05FF]/.test(text)) return "Hebrew";
  if (/[\u4E00-\u9FFF]/.test(text)) return "CJK";
  if (/[\u3040-\u309F]/.test(text)) return "Hiragana (Japanese)";
  if (/[\u30A0-\u30FF]/.test(text)) return "Katakana (Japanese)";
  if (/[\uAC00-\uD7AF]/.test(text)) return "Hangul (Korean)";
  if (/[\u0400-\u04FF]/.test(text)) return "Cyrillic";
  if (/[\u0370-\u03FF]/.test(text)) return "Greek";
  if (/[\u0E00-\u0E7F]/.test(text)) return "Thai";
  return "Latin or Unknown";
}

This cannot distinguish between languages sharing the same script (e.g., English vs French, Chinese vs Japanese kanji).

2. N-gram Frequency Analysis

Statistical approach comparing character n-gram frequencies against known language profiles:

  • Bigrams: "th", "he", "in" are frequent in English
  • Trigrams: "the", "and", "ing" are highly English-specific
  • Libraries like franc use this technique

3. Stop Word Detection

Check for language-specific common words:

const stopWords = {
  en: ["the", "is", "at", "which", "on"],
  fr: ["le", "la", "les", "de", "et"],
  de: ["der", "die", "das", "und", "ist"],
  es: ["el", "la", "los", "de", "en"],
  ja: ["の", "は", "を", "が", "で"],
};

4. Machine Learning

Modern approaches use neural networks trained on multilingual corpora. Libraries:

  • fastText (Facebook/Meta) — identifies 176 languages
  • langdetect (Python, based on Google's algorithm)
  • cld3 (Google Compact Language Detector v3)

Challenges

  • Short text (tweets, search queries) has low accuracy
  • Code-mixed text ("Spanglish", "Hinglish") confuses detectors
  • Similar languages (Norwegian/Danish/Swedish, Serbian/Croatian) are hard to distinguish
  • Romanized text (e.g., Japanese romaji) lacks script cues

Use Case

Language detection is used in search engines to route queries, in email clients to suggest translations, in content moderation to apply the right language filters, and in CMS platforms to auto-tag content. It helps assign the correct ISO 639 language code to user-generated content.

Try It — Language Code Reference

Open full tool