XML Encoding Declaration — UTF-8, UTF-16, and Character Sets

Q: XML Encoding Declaration — UTF-8, UTF-16, and Character Sets

## XML Encoding Declaration The XML prolog's encoding declaration tells the parser which character encoding is used in the document. Getting this wrong leads to garbled text, parse failures, and data corruption. ### The XML Prolog xml Hello, world! The encoding attribute specifies the character encoding. If omitted, UTF-8 is assumed by default (or UTF-16 if a BOM is present). ### Common Encodings | Encoding | Descr

Understand the XML encoding declaration in the XML prolog. Learn about UTF-8, UTF-16, ISO-8859-1, how encoding affects parsing, and how to fix encoding mismatch errors.

Validation

Detailed Explanation

XML Encoding Declaration

The XML prolog's encoding declaration tells the parser which character encoding is used in the document. Getting this wrong leads to garbled text, parse failures, and data corruption.

The XML Prolog

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <message>Hello, world!</message>
</root>

The encoding attribute specifies the character encoding. If omitted, UTF-8 is assumed by default (or UTF-16 if a BOM is present).

Common Encodings

Encoding	Description	Use Case
UTF-8	Variable-width, ASCII-compatible	Default for modern XML, web content
UTF-16	Fixed 2-byte (or surrogate pairs)	Windows APIs, Java internal strings
ISO-8859-1	Western European, single-byte	Legacy European systems
Shift_JIS	Japanese, variable-width	Legacy Japanese systems
Windows-1252	Windows Western European superset	Legacy Windows applications

Encoding Mismatch Errors

The most common issue is declaring one encoding but saving the file in another:

<!-- File saved as UTF-8 but declares: -->
<?xml version="1.0" encoding="ISO-8859-1"?>

This causes the parser to misinterpret multi-byte UTF-8 sequences as ISO-8859-1 characters, producing garbled output (mojibake) for non-ASCII text.

How to Fix Encoding Issues

Check the actual file encoding using a hex editor or file command
Match the declaration to the actual encoding

Convert to UTF-8 using tools like iconv if needed:

iconv -f ISO-8859-1 -t UTF-8 input.xml > output.xml

Update the declaration to encoding="UTF-8"

BOM (Byte Order Mark)

UTF-8 files may include a BOM (EF BB BF) at the start. While valid, it can cause issues with some XML parsers. UTF-16 files require a BOM to indicate byte order (big-endian or little-endian).

Best Practice

Always use UTF-8 for new XML documents. It is the universal default, supported by all modern parsers, and handles all Unicode characters efficiently.

Use Case

Understanding XML encoding is critical when processing XML files from international sources, migrating legacy systems that use non-UTF-8 encodings, handling XML data feeds that contain characters from multiple languages, and debugging encoding errors in CI/CD pipelines or API integrations.

Try It — XML Formatter

Open full tool →