Byte Order Mark (BOM) and Encoding Markers in Unicode

Q: Byte Order Mark (BOM) and Encoding Markers in Unicode

## Byte Order Mark (BOM) The Byte Order Mark (BOM) is the Unicode character U+FEFF placed at the beginning of a text file to indicate its encoding and byte order. While useful for encoding detection, it frequently causes issues in data processing. ### BOM in Different Encodings | Encoding | BOM Bytes | Byte Order | |----------|----------|------------| | UTF-8 | EF BB BF | N/A (UTF-8 has no byte order issue) | | UTF-16 BE | FE FF | Big-endian | | UTF-16 LE | FF FE | Little-endian | | UTF-32 BE

Understand the Byte Order Mark (U+FEFF), its role in UTF-8/UTF-16/UTF-32 detection, its 3-byte UTF-8 representation, and how to identify and handle it in files.

Encoding Issues

Detailed Explanation

Byte Order Mark (BOM)

The Byte Order Mark (BOM) is the Unicode character U+FEFF placed at the beginning of a text file to indicate its encoding and byte order. While useful for encoding detection, it frequently causes issues in data processing.

BOM in Different Encodings

Encoding	BOM Bytes	Byte Order
UTF-8	EF BB BF	N/A (UTF-8 has no byte order issue)
UTF-16 BE	FE FF	Big-endian
UTF-16 LE	FF FE	Little-endian
UTF-32 BE	00 00 FE FF	Big-endian
UTF-32 LE	FF FE 00 00	Little-endian

UTF-8 BOM

The UTF-8 BOM (EF BB BF) is technically unnecessary because UTF-8 does not have byte order ambiguity. However, some applications (notably Windows Notepad and Excel) add it to UTF-8 files. This 3-byte sequence at the start of a file can cause problems:

Shell scripts: A BOM before #!/bin/bash breaks the shebang
JSON/YAML parsing: Some parsers reject files starting with BOM
CSV imports: The BOM may appear as a garbage character in the first field
HTTP responses: BOM before HTML can trigger quirks mode
Concatenation: Concatenating BOM-prefixed files puts BOM in the middle

Detection with the Unicode Inspector

Paste suspicious text into the Unicode Inspector. If a BOM is present, you will see U+FEFF as the first character with the name "BYTE ORDER MARK" and category "Other". Its 3-byte UTF-8 encoding (EF BB BF) is clearly displayed.

Historical Note

U+FEFF originally served a dual purpose: as a BOM at file start and as ZERO WIDTH NO-BREAK SPACE (ZWNBSP) within text. Unicode 3.2 deprecated the ZWNBSP use and introduced U+2060 (WORD JOINER) as its replacement. When U+FEFF appears anywhere other than the beginning of a file, it should be treated as a legacy ZWNBSP.

Removing the BOM

In most programming languages, stripping the BOM is straightforward:

JavaScript: text.replace(/^\uFEFF/, '')
Python: Open with encoding='utf-8-sig'
Command line: sed '1s/^\xEF\xBB\xBF//' file.txt

Use Case

Use this when debugging file parsing errors caused by invisible BOM characters, investigating why CSV or JSON files fail to parse despite appearing correct, or verifying the encoding of files received from external systems.

Try It — Unicode Inspector

Open full tool →