Byte Order Mark (BOM) and Its Effect on String Length
Learn how the Byte Order Mark (U+FEFF) affects string length, why it appears at the start of files, and how to detect and handle it in your applications.
Detailed Explanation
The Byte Order Mark (BOM)
The Byte Order Mark (U+FEFF) is a special Unicode character that can appear at the beginning of a text file to indicate the encoding and byte order. It is invisible in most text editors but counts toward string length.
BOM Representation in Different Encodings
| Encoding | BOM Bytes | Hex |
|---|---|---|
| UTF-8 | 3 bytes | EF BB BF |
| UTF-16 BE | 2 bytes | FE FF |
| UTF-16 LE | 2 bytes | FF FE |
| UTF-32 BE | 4 bytes | 00 00 FE FF |
| UTF-32 LE | 4 bytes | FF FE 00 00 |
Impact on String Length
When you read a file with a BOM and the BOM is not stripped:
// File content: BOM + "Hello"
const text = "\uFEFFHello";
text.length; // 6 (not 5!)
text.charCodeAt(0); // 65279 (U+FEFF)
text[0] === "\uFEFF"; // true
The BOM adds:
- 1 to
.lengthand code point count - 3 bytes to UTF-8 size
- 2 bytes to UTF-16 size
- 4 bytes to UTF-32 size
Common BOM Problems
- JSON parsing failure:
JSON.parse("\uFEFFHello")throws an error because the BOM is not valid JSON - CSV first column corruption: The first field in a CSV file may start with the invisible BOM
- HTTP header issues: BOM before PHP
<?phptag causes "headers already sent" errors - String comparison failure:
"\uFEFFhello" !== "hello" - Hash mismatches: Same visible content produces different hashes with and without BOM
Detection and Removal
// Detect BOM
const hasBOM = str.charCodeAt(0) === 0xFEFF;
// Remove BOM
const clean = str.replace(/^\uFEFF/, "");
When BOM Is Useful
- UTF-16 files: BOM indicates byte order (big-endian vs little-endian), which is essential
- Windows Notepad: Saves UTF-8 files with BOM by default (a common source of problems)
- Excel CSV: Expects UTF-8 BOM to correctly interpret Unicode characters
Best Practice
For UTF-8 files on the web, do not use a BOM. The Unicode standard recommends against it for UTF-8 because UTF-8 has no byte-order ambiguity. If you receive files with a BOM, strip it during processing.
Use Case
When processing text files from different sources (Windows Notepad, Excel exports, legacy systems), detecting and handling the Byte Order Mark prevents parsing errors, hash mismatches, and invisible character issues in data pipelines.