Unicode Escape Sequences Across Programming Languages
Comprehensive guide to Unicode escape sequences including \\uXXXX, \\u{XXXXX}, \\UXXXXXXXX, and \\N{name} formats across JavaScript, Python, Java, Go, Rust, and other languages. Covers surrogate pairs and emoji encoding.
Detailed Explanation
Unicode Escape Sequences
Unicode escape sequences allow embedding any Unicode character in source code using its code point number. Different languages use different syntax, but the concept is the same: represent a character by its numeric identity.
Common Formats
\\uXXXX → BMP character (4 hex digits)
Used by: JavaScript, Java, C#, JSON
\\u{XXXXX} → any code point (1-6 hex digits)
Used by: JavaScript (ES6+), Rust, Swift
\\UXXXXXXXX → any code point (8 hex digits)
Used by: Python, Go, C/C++
\\xHH → single byte (2 hex digits)
Used by: JavaScript, Python, C, PHP
\\N{NAME} → character by Unicode name
Used by: Python only
Basic Multilingual Plane (BMP)
Characters in the BMP (U+0000 to U+FFFF) include most common scripts and can be represented with a single \\uXXXX escape:
"\u0041" // A
"\u00E9" // é (e with acute)
"\u4E16" // 世 (CJK character for "world")
"\u03B1" // α (Greek alpha)
Supplementary Characters and Surrogate Pairs
Characters above U+FFFF (emoji, historic scripts, math symbols) require special handling in languages limited to \\uXXXX:
// Emoji: U+1F600 (Grinning Face)
// ES6+ code point escape:
"\u{1F600}"
// Pre-ES6 surrogate pair:
"\uD83D\uDE00"
The surrogate pair is computed from the code point using a specific formula defined by UTF-16 encoding.
Language-Specific Examples
Python:
"\u0041" # A (BMP)
"\U0001F600" # 😀 (supplementary)
"\N{SNOWMAN}" # ☃
Go:
"\u0041" // A (BMP)
"\U0001F600" // 😀 (supplementary)
Rust:
"\u{41}" // A
"\u{1F600}" // 😀
// No \\xHH or \\uXXXX (fixed-width) in Rust strings
When to Use Unicode Escapes
- Non-printable characters: Control characters, zero-width spaces, directional marks.
- Source encoding safety: Ensure code works regardless of file encoding.
- Preventing homoglyph attacks: Making lookalike characters visible in source.
- JSON compatibility: JSON only supports
\\uXXXX, making surrogate pairs necessary for supplementary characters.
Use Case
Unicode escapes are used when handling internationalization, processing emoji in APIs and databases, working with mathematical or scientific notation in source code, creating language-learning applications, building text processing tools, ensuring source file portability across systems with different default encodings, and debugging encoding issues in data pipelines.