Reversing Unicode and Emoji Correctly

Understand the challenges of reversing strings containing Unicode characters, emoji, combining marks, and surrogate pairs. Learn safe reversal techniques across languages.

Programming

Detailed Explanation

The Unicode Reversal Problem

Reversing strings containing Unicode characters is more complex than it appears. Naive reversal algorithms can produce corrupted output because they operate on code units rather than visual characters (grapheme clusters).

The Problem with Surrogate Pairs

In UTF-16 (used internally by JavaScript and Java), characters outside the Basic Multilingual Plane (BMP) — like most emoji — are represented as two 16-bit code units called a surrogate pair.

"😀".length            // 2 (two code units!)
"😀".split("")         // ["\uD83D", "\uDE00"] (broken apart)
"😀".split("").reverse().join("") // "\uDE00\uD83D" (corrupted!)

Combining Characters

Some characters are composed of a base character plus combining marks:

é = e + \u0301 (combining acute accent)

Naive reversal separates the combining mark from its base:

"café" reversed naively → "é\u0301fac" (accent moves to wrong character)

Grapheme Clusters

A grapheme cluster is what a user perceives as a single character. Some grapheme clusters consist of multiple code points:

  • Emoji with skin tone: 👋🏽 = 👋 + 🏽 (two code points)
  • Family emoji: 👨‍👩‍👧 = 👨 + ZWJ + 👩 + ZWJ + 👧 (five code points)
  • Flag emoji: 🇺🇸 = 🇺 + 🇸 (two regional indicator symbols)

Safe Reversal Techniques

JavaScript (code point level):

const reversed = [...str].reverse().join("");

This handles surrogate pairs but not combining characters or complex emoji.

JavaScript (grapheme cluster level):

const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
const segments = [...segmenter.segment(str)].map(s => s.segment);
const reversed = segments.reverse().join("");

Python:

import grapheme
reversed_str = "".join(reversed(grapheme.graphemes(original)))

Summary

Level Handles Surrogate Pairs Handles Combining Marks Handles Complex Emoji
Code unit No No No
Code point Yes No No
Grapheme cluster Yes Yes Yes

Use Case

Understanding Unicode-safe string reversal is critical for developers building internationalized applications, text processing tools, and any software that handles user-generated content including emoji. It is also a common topic in senior-level coding interviews.

Try It — Reverse Text

Open full tool