Surrogate Pairs: Characters Beyond the BMP
Learn how characters outside the Basic Multilingual Plane use surrogate pairs in UTF-16, causing JavaScript's .length to return 2 for a single character.
Detailed Explanation
Surrogate Pairs in UTF-16
The Basic Multilingual Plane (BMP) contains Unicode code points U+0000 through U+FFFF. Characters beyond this range (supplementary planes) cannot fit in a single 16-bit code unit, so UTF-16 encodes them as surrogate pairs — two 16-bit code units working together.
How Surrogate Pairs Work
For a code point like U+1F600 (😀 Grinning Face):
- Subtract 0x10000: 0x1F600 - 0x10000 = 0xF600
- High surrogate: 0xD800 + (0xF600 >> 10) = 0xD83D
- Low surrogate: 0xDC00 + (0xF600 & 0x3FF) = 0xDE00
JavaScript stores this as two code units: \uD83D\uDE00
Impact on JavaScript .length
"😀".length // 2 (surrogate pair)
[..."😀"].length // 1 (code points)
"😀".codePointAt(0) // 128512 (U+1F600)
Characters That Use Surrogate Pairs
| Category | Range | Examples |
|---|---|---|
| Emoji | U+1F600–U+1FAFF | 😀 🚀 🍕 |
| Math symbols | U+1D400–U+1D7FF | 𝐀 𝐁 𝐂 (bold math) |
| Musical symbols | U+1D100–U+1D1FF | 𝄞 (treble clef) |
| Historic scripts | U+10000–U+1007F | 𐀀 (Linear B) |
| CJK Extension B+ | U+20000–U+2A6FF | Rare kanji |
String Operations That Break
Common string operations can corrupt surrogate pairs:
// WRONG: May split surrogate pair
str.substring(0, 1) // Could return lone high surrogate
str.charAt(0) // Returns only high surrogate
// CORRECT: Use code-point-aware methods
[...str].slice(0, 1).join("")
str.slice(0, [...str][0].length)
Byte Sizes
| Encoding | Bytes per Surrogate-Pair Character |
|---|---|
| UTF-8 | 4 bytes |
| UTF-16 | 4 bytes (2 code units × 2 bytes) |
| UTF-32 | 4 bytes (always) |
Interestingly, all three encodings use the same 4 bytes for supplementary characters. The difference only matters for BMP characters.
Use Case
When building JavaScript applications that manipulate strings containing emoji or rare characters, understanding surrogate pairs is essential to avoid data corruption during substring operations, database storage, and API payload handling.