Language Detection Basics — Identifying Languages Programmatically
Overview of techniques for automatic language detection in text, including n-gram analysis, Unicode range detection, and browser APIs.
Detailed Explanation
Detecting Languages Programmatically
Language detection is the process of identifying which language a piece of text is written in. While the language code reference helps you use codes, detection helps you assign them.
Detection Techniques
1. Unicode Script Detection
The fastest method: identify the script (writing system) used in the text.
function detectScript(text) {
if (/[\u0600-\u06FF]/.test(text)) return "Arabic";
if (/[\u0590-\u05FF]/.test(text)) return "Hebrew";
if (/[\u4E00-\u9FFF]/.test(text)) return "CJK";
if (/[\u3040-\u309F]/.test(text)) return "Hiragana (Japanese)";
if (/[\u30A0-\u30FF]/.test(text)) return "Katakana (Japanese)";
if (/[\uAC00-\uD7AF]/.test(text)) return "Hangul (Korean)";
if (/[\u0400-\u04FF]/.test(text)) return "Cyrillic";
if (/[\u0370-\u03FF]/.test(text)) return "Greek";
if (/[\u0E00-\u0E7F]/.test(text)) return "Thai";
return "Latin or Unknown";
}
This cannot distinguish between languages sharing the same script (e.g., English vs French, Chinese vs Japanese kanji).
2. N-gram Frequency Analysis
Statistical approach comparing character n-gram frequencies against known language profiles:
- Bigrams: "th", "he", "in" are frequent in English
- Trigrams: "the", "and", "ing" are highly English-specific
- Libraries like
francuse this technique
3. Stop Word Detection
Check for language-specific common words:
const stopWords = {
en: ["the", "is", "at", "which", "on"],
fr: ["le", "la", "les", "de", "et"],
de: ["der", "die", "das", "und", "ist"],
es: ["el", "la", "los", "de", "en"],
ja: ["の", "は", "を", "が", "で"],
};
4. Machine Learning
Modern approaches use neural networks trained on multilingual corpora. Libraries:
- fastText (Facebook/Meta) — identifies 176 languages
- langdetect (Python, based on Google's algorithm)
- cld3 (Google Compact Language Detector v3)
Challenges
- Short text (tweets, search queries) has low accuracy
- Code-mixed text ("Spanglish", "Hinglish") confuses detectors
- Similar languages (Norwegian/Danish/Swedish, Serbian/Croatian) are hard to distinguish
- Romanized text (e.g., Japanese romaji) lacks script cues
Use Case
Language detection is used in search engines to route queries, in email clients to suggest translations, in content moderation to apply the right language filters, and in CMS platforms to auto-tag content. It helps assign the correct ISO 639 language code to user-generated content.
Try It — Language Code Reference
Related Topics
ISO 639-1 Overview — Two-Letter Language Codes
Standards
Accept-Language Header — HTTP Content Negotiation
Web Development
Locale Negotiation in Web Apps — Choosing the Right Language
Web Development
CJK Language Codes — Chinese, Japanese, and Korean
Internationalization
BCP 47 Language Tags — The Web Standard for Locale Identifiers
Standards