Character Encoding in XML and JSON Conversion
Handle character encoding differences between XML and JSON. Covers UTF-8, UTF-16, XML entity references, and ensuring correct character preservation during conversion.
Detailed Explanation
Character encoding is handled differently in XML and JSON, and mismatches can cause data corruption during conversion.
XML encoding flexibility:
XML supports multiple character encodings, declared in the XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="UTF-16"?>
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml version="1.0" encoding="Shift_JIS"?>
JSON encoding is always UTF-8. Per RFC 8259, JSON text exchanged between systems must be encoded in UTF-8. No encoding declaration is needed or possible.
XML character references:
XML can represent characters using numeric references:
<text>Café — a nice place</text>
<!-- é = e with acute (decimal) -->
<!-- — = em dash (hexadecimal) -->
In JSON, these become their actual Unicode characters:
{ "text": "Cafe\u0301 \u2014 a nice place" }
Or, since JSON supports UTF-8 directly:
{ "text": "Caf\u00e9 \u2014 a nice place" }
XML predefined entities:
XML has five predefined entity references that must be escaped in element content and attribute values:
| Entity | Character | JSON equivalent |
|---|---|---|
< |
< |
< (direct) |
> |
> |
> (direct) |
& |
& |
& (direct) |
" |
" |
\" (escaped) |
' |
' |
' (direct) |
Conversion considerations:
- XML to JSON: Decode all entity references and character references to their actual characters. The JSON output should contain the real characters, not the XML escape sequences.
- JSON to XML: Escape special characters using either entity references or CDATA sections. At minimum,
<,>, and&must be escaped in text content. - Non-BMP characters (emoji, CJK extensions) are valid in both formats but may need surrogate pairs in JSON (
\uD83D\uDE00for a smiley) or numeric references in XML (😀).
Best practice: Always normalize to UTF-8 during conversion. If the source XML uses a non-UTF-8 encoding, transcode the bytes to UTF-8 before parsing.
Use Case
Converting XML documents from a Japanese legacy system that uses Shift_JIS encoding into UTF-8 JSON for a modern web application, ensuring all kanji characters are preserved correctly.