UTF-8 BOM — Byte Order Mark (EF BB BF)
Understand the UTF-8 Byte Order Mark (BOM) bytes EF BB BF and when they appear in files. Learn why BOM causes issues in PHP, shell scripts, and JSON parsing.
Hex
EF BB BF
ASCII
(invisible BOM)
Detailed Explanation
The UTF-8 Byte Order Mark (BOM) is a three-byte sequence — EF BB BF — that can appear at the very beginning of a text file to signal that the file is encoded in UTF-8. While well-intentioned, the UTF-8 BOM frequently causes bugs and is generally considered unnecessary by modern standards.
What is the BOM?
The BOM is Unicode character U+FEFF (Zero Width No-Break Space). In UTF-8, this character encodes to three bytes: EF BB BF. It was originally designed for UTF-16 encoding, where it serves the essential purpose of indicating byte order — whether the file uses big-endian (FE FF) or little-endian (FF FE) byte ordering. In UTF-8, there is no byte-order ambiguity (bytes are always in the same order), so the BOM serves only as an encoding identifier.
BOMs across Unicode encodings:
| Encoding | BOM (Hex) | Size |
|---|---|---|
| UTF-8 | EF BB BF |
3 bytes |
| UTF-16 Big Endian | FE FF |
2 bytes |
| UTF-16 Little Endian | FF FE |
2 bytes |
| UTF-32 Big Endian | 00 00 FE FF |
4 bytes |
| UTF-32 Little Endian | FF FE 00 00 |
4 bytes |
Problems caused by the UTF-8 BOM:
- PHP — A BOM before the
<?phptag causes "headers already sent" errors because the three BOM bytes are output before any HTTP headers can be set - Shell scripts — A BOM before
#!/bin/bashmakes the shebang line unrecognizable, and the script fails to execute - JSON — The JSON specification (RFC 8259) states that JSON text must not begin with a BOM, though many parsers tolerate it
- CSV imports — Some spreadsheet programs mishandle BOM-prefixed CSV files, either displaying the BOM as garbled characters or misinterpreting the first field
- Concatenation — When combining multiple BOM-prefixed files, the BOM appears in the middle of the output, potentially causing parsing errors
When the BOM is useful:
On Windows, many programs (especially Notepad and older Microsoft tools) rely on the BOM to auto-detect UTF-8 encoding. Without it, they may misinterpret the file as the legacy Windows-1252 encoding. If you are creating files specifically for consumption by Windows tools, including the BOM may prevent encoding detection failures.
Detecting and removing the BOM:
In a hex editor, the BOM is immediately visible at offset 0: EF BB BF. If the file starts with these three bytes and you are experiencing any of the above issues, simply delete them. On the command line, you can detect and strip the BOM using sed, awk, or specialized tools. Many modern editors offer an explicit option to save "UTF-8 without BOM."
The Unicode Consortium's recommendation:
The Unicode standard neither requires nor recommends the use of BOM for UTF-8. It states that the BOM is permitted but not necessary, and its use at the beginning of a data stream may be treated as zero-width no-break space by applications that do not interpret it as a signature.
Use Case
Developers encounter BOM issues when debugging PHP 'headers already sent' errors, resolving shell script execution failures, or troubleshooting JSON parsing problems in files created by Windows applications.