XML Encoding Declaration — UTF-8, UTF-16, and Character Sets
Understand the XML encoding declaration in the XML prolog. Learn about UTF-8, UTF-16, ISO-8859-1, how encoding affects parsing, and how to fix encoding mismatch errors.
Detailed Explanation
XML Encoding Declaration
The XML prolog's encoding declaration tells the parser which character encoding is used in the document. Getting this wrong leads to garbled text, parse failures, and data corruption.
The XML Prolog
<?xml version="1.0" encoding="UTF-8"?>
<root>
<message>Hello, world!</message>
</root>
The encoding attribute specifies the character encoding. If omitted, UTF-8 is assumed by default (or UTF-16 if a BOM is present).
Common Encodings
| Encoding | Description | Use Case |
|---|---|---|
| UTF-8 | Variable-width, ASCII-compatible | Default for modern XML, web content |
| UTF-16 | Fixed 2-byte (or surrogate pairs) | Windows APIs, Java internal strings |
| ISO-8859-1 | Western European, single-byte | Legacy European systems |
| Shift_JIS | Japanese, variable-width | Legacy Japanese systems |
| Windows-1252 | Windows Western European superset | Legacy Windows applications |
Encoding Mismatch Errors
The most common issue is declaring one encoding but saving the file in another:
<!-- File saved as UTF-8 but declares: -->
<?xml version="1.0" encoding="ISO-8859-1"?>
This causes the parser to misinterpret multi-byte UTF-8 sequences as ISO-8859-1 characters, producing garbled output (mojibake) for non-ASCII text.
How to Fix Encoding Issues
- Check the actual file encoding using a hex editor or
filecommand - Match the declaration to the actual encoding
- Convert to UTF-8 using tools like
iconvif needed:iconv -f ISO-8859-1 -t UTF-8 input.xml > output.xml - Update the declaration to
encoding="UTF-8"
BOM (Byte Order Mark)
UTF-8 files may include a BOM (EF BB BF) at the start. While valid, it can cause issues with some XML parsers. UTF-16 files require a BOM to indicate byte order (big-endian or little-endian).
Best Practice
Always use UTF-8 for new XML documents. It is the universal default, supported by all modern parsers, and handles all Unicode characters efficiently.
Use Case
Understanding XML encoding is critical when processing XML files from international sources, migrating legacy systems that use non-UTF-8 encodings, handling XML data feeds that contain characters from multiple languages, and debugging encoding errors in CI/CD pipelines or API integrations.