Byte Order Mark (BOM) and Its Effect on String Length

Q: Byte Order Mark (BOM) and Its Effect on String Length

## The Byte Order Mark (BOM) The Byte Order Mark (U+FEFF) is a special Unicode character that can appear at the beginning of a text file to indicate the encoding and byte order. It is invisible in most text editors but counts toward string length. ### BOM Representation in Different Encodings | Encoding | BOM Bytes | Hex | |----------|----------|-----| | UTF-8 | 3 bytes | EF BB BF | | UTF-16 BE | 2 bytes | FE FF | | UTF-16 LE | 2 bytes | FF FE | | UTF-32 BE | 4 bytes | 00 00 FE FF | | UTF-32

Q: When is this useful?

When processing text files from different sources (Windows Notepad, Excel exports, legacy systems), detecting and handling the Byte Order Mark prevents parsing errors, hash mismatches, and invisible character issues in data pipelines.

Learn how the Byte Order Mark (U+FEFF) affects string length, why it appears at the start of files, and how to detect and handle it in your applications.

Encoding Comparison

Detailed Explanation

The Byte Order Mark (BOM)

The Byte Order Mark (U+FEFF) is a special Unicode character that can appear at the beginning of a text file to indicate the encoding and byte order. It is invisible in most text editors but counts toward string length.

BOM Representation in Different Encodings

Encoding	BOM Bytes	Hex
UTF-8	3 bytes	EF BB BF
UTF-16 BE	2 bytes	FE FF
UTF-16 LE	2 bytes	FF FE
UTF-32 BE	4 bytes	00 00 FE FF
UTF-32 LE	4 bytes	FF FE 00 00

Impact on String Length

When you read a file with a BOM and the BOM is not stripped:

// File content: BOM + "Hello"
const text = "\uFEFFHello";
text.length;           // 6 (not 5!)
text.charCodeAt(0);    // 65279 (U+FEFF)
text[0] === "\uFEFF";  // true

The BOM adds:

1 to .length and code point count
3 bytes to UTF-8 size
2 bytes to UTF-16 size
4 bytes to UTF-32 size

Common BOM Problems

JSON parsing failure: JSON.parse("\uFEFFHello") throws an error because the BOM is not valid JSON
CSV first column corruption: The first field in a CSV file may start with the invisible BOM
HTTP header issues: BOM before PHP <?php tag causes "headers already sent" errors
String comparison failure: "\uFEFFhello" !== "hello"
Hash mismatches: Same visible content produces different hashes with and without BOM

Detection and Removal

// Detect BOM
const hasBOM = str.charCodeAt(0) === 0xFEFF;

// Remove BOM
const clean = str.replace(/^\uFEFF/, "");

When BOM Is Useful

UTF-16 files: BOM indicates byte order (big-endian vs little-endian), which is essential
Windows Notepad: Saves UTF-8 files with BOM by default (a common source of problems)
Excel CSV: Expects UTF-8 BOM to correctly interpret Unicode characters

Best Practice

For UTF-8 files on the web, do not use a BOM. The Unicode standard recommends against it for UTF-8 because UTF-8 has no byte-order ambiguity. If you receive files with a BOM, strip it during processing.

Use Case