Byte Order Mark (BOM) and Encoding Markers in Unicode
Understand the Byte Order Mark (U+FEFF), its role in UTF-8/UTF-16/UTF-32 detection, its 3-byte UTF-8 representation, and how to identify and handle it in files.
Detailed Explanation
Byte Order Mark (BOM)
The Byte Order Mark (BOM) is the Unicode character U+FEFF placed at the beginning of a text file to indicate its encoding and byte order. While useful for encoding detection, it frequently causes issues in data processing.
BOM in Different Encodings
| Encoding | BOM Bytes | Byte Order |
|---|---|---|
| UTF-8 | EF BB BF | N/A (UTF-8 has no byte order issue) |
| UTF-16 BE | FE FF | Big-endian |
| UTF-16 LE | FF FE | Little-endian |
| UTF-32 BE | 00 00 FE FF | Big-endian |
| UTF-32 LE | FF FE 00 00 | Little-endian |
UTF-8 BOM
The UTF-8 BOM (EF BB BF) is technically unnecessary because UTF-8 does not have byte order ambiguity. However, some applications (notably Windows Notepad and Excel) add it to UTF-8 files. This 3-byte sequence at the start of a file can cause problems:
- Shell scripts: A BOM before
#!/bin/bashbreaks the shebang - JSON/YAML parsing: Some parsers reject files starting with BOM
- CSV imports: The BOM may appear as a garbage character in the first field
- HTTP responses: BOM before HTML can trigger quirks mode
- Concatenation: Concatenating BOM-prefixed files puts BOM in the middle
Detection with the Unicode Inspector
Paste suspicious text into the Unicode Inspector. If a BOM is present, you will see U+FEFF as the first character with the name "BYTE ORDER MARK" and category "Other". Its 3-byte UTF-8 encoding (EF BB BF) is clearly displayed.
Historical Note
U+FEFF originally served a dual purpose: as a BOM at file start and as ZERO WIDTH NO-BREAK SPACE (ZWNBSP) within text. Unicode 3.2 deprecated the ZWNBSP use and introduced U+2060 (WORD JOINER) as its replacement. When U+FEFF appears anywhere other than the beginning of a file, it should be treated as a legacy ZWNBSP.
Removing the BOM
In most programming languages, stripping the BOM is straightforward:
- JavaScript:
text.replace(/^\uFEFF/, '') - Python: Open with
encoding='utf-8-sig' - Command line:
sed '1s/^\xEF\xBB\xBF//' file.txt
Use Case
Use this when debugging file parsing errors caused by invisible BOM characters, investigating why CSV or JSON files fail to parse despite appearing correct, or verifying the encoding of files received from external systems.