Unicode in JSON

Understand how JSON handles Unicode characters, encoding requirements, escape sequences, and common pitfalls with emoji, CJK characters, and surrogate pairs.

Syntax

Detailed Explanation

JSON is defined as a Unicode-based format. The specification (RFC 8259) states that JSON text must be encoded in UTF-8, and all JSON parsers are required to accept UTF-8 at minimum. JSON strings can contain any Unicode character, either directly (as literal UTF-8 bytes) or via the \uXXXX escape syntax.

Encoding requirements:

RFC 8259 mandates that JSON exchanged between systems "MUST be encoded using UTF-8." Earlier specifications allowed UTF-16 and UTF-32, but the current standard strongly favors UTF-8. In practice, virtually all JSON on the web is UTF-8 encoded. If your JSON file is saved in a different encoding (like Latin-1 or Shift-JIS), parsers may produce incorrect results or throw errors on non-ASCII characters.

Unicode escape sequences:

Any Unicode character can be represented using the \uXXXX escape, where XXXX is the four-digit hexadecimal code point. For example:

  • \u0041 = A
  • \u00E9 = e with acute accent
  • \u4E16 = the Chinese character for "world"
  • \u2603 = snowman

Both the literal character and its escape sequence are semantically identical: "caf\u00E9" and "cafe\u0301" may look similar but represent different Unicode sequences (precomposed vs. combining characters).

Emoji and surrogate pairs:

Characters outside the Basic Multilingual Plane (BMP), including most emoji, have code points above U+FFFF and cannot be represented with a single \uXXXX escape. Instead, they require a UTF-16 surrogate pair — two consecutive \u escapes:

{"emoji": "\uD83D\uDE00"}

This represents U+1F600 (grinning face). Modern JSON parsers handle surrogate pairs automatically, but some older or simpler parsers may not, producing garbled output or errors. When writing JSON by hand, it is simpler and clearer to include the literal emoji character: {"emoji": "grinning face emoji"}.

Control characters:

JSON strings must not contain literal control characters (U+0000 through U+001F). These must be escaped: \n for newline, \t for tab, \r for carriage return, and \uXXXX for other control characters. Attempting to include a literal null byte (U+0000) in a JSON string is invalid and will cause a parse error.

Common mistakes developers make:

The most common mistake is saving JSON files in the wrong encoding, especially on Windows where text editors may default to Windows-1252. Another mistake is generating escape sequences for characters that do not need them — while \u escapes are always valid, they reduce readability without benefit. Some developers incorrectly split surrogate pairs across string boundaries or process the two halves independently, producing invalid Unicode. Another issue is not normalizing Unicode strings before comparison: "cafe\u0301" (e + combining accent) and "caf\u00E9" (precomposed) look the same when rendered but are different byte sequences.

Best practices:

Always save JSON files as UTF-8 without BOM. Use literal Unicode characters instead of escape sequences for readability. Only use \uXXXX escapes for control characters or when your transmission channel does not support UTF-8. Be aware of Unicode normalization when comparing strings. Test your application with non-ASCII input including CJK characters, emoji, and right-to-left scripts like Arabic and Hebrew.

Use Case

Correctly encoding a multilingual product catalog in JSON that includes Chinese product names, Arabic descriptions, and emoji ratings without encoding errors.

Try It — JSON Formatter

Open full tool