XML Encoding Declaration — UTF-8, UTF-16, and Character Sets

Understand the XML encoding declaration in the XML prolog. Learn about UTF-8, UTF-16, ISO-8859-1, how encoding affects parsing, and how to fix encoding mismatch errors.

Validation

Detailed Explanation

XML Encoding Declaration

The XML prolog's encoding declaration tells the parser which character encoding is used in the document. Getting this wrong leads to garbled text, parse failures, and data corruption.

The XML Prolog

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <message>Hello, world!</message>
</root>

The encoding attribute specifies the character encoding. If omitted, UTF-8 is assumed by default (or UTF-16 if a BOM is present).

Common Encodings

Encoding Description Use Case
UTF-8 Variable-width, ASCII-compatible Default for modern XML, web content
UTF-16 Fixed 2-byte (or surrogate pairs) Windows APIs, Java internal strings
ISO-8859-1 Western European, single-byte Legacy European systems
Shift_JIS Japanese, variable-width Legacy Japanese systems
Windows-1252 Windows Western European superset Legacy Windows applications

Encoding Mismatch Errors

The most common issue is declaring one encoding but saving the file in another:

<!-- File saved as UTF-8 but declares: -->
<?xml version="1.0" encoding="ISO-8859-1"?>

This causes the parser to misinterpret multi-byte UTF-8 sequences as ISO-8859-1 characters, producing garbled output (mojibake) for non-ASCII text.

How to Fix Encoding Issues

  1. Check the actual file encoding using a hex editor or file command
  2. Match the declaration to the actual encoding
  3. Convert to UTF-8 using tools like iconv if needed:
    iconv -f ISO-8859-1 -t UTF-8 input.xml > output.xml
    
  4. Update the declaration to encoding="UTF-8"

BOM (Byte Order Mark)

UTF-8 files may include a BOM (EF BB BF) at the start. While valid, it can cause issues with some XML parsers. UTF-16 files require a BOM to indicate byte order (big-endian or little-endian).

Best Practice

Always use UTF-8 for new XML documents. It is the universal default, supported by all modern parsers, and handles all Unicode characters efficiently.

Use Case

Understanding XML encoding is critical when processing XML files from international sources, migrating legacy systems that use non-UTF-8 encodings, handling XML data feeds that contain characters from multiple languages, and debugging encoding errors in CI/CD pipelines or API integrations.

Try It — XML Formatter

Open full tool