CSV Character Encoding: UTF-8, Shift-JIS, and More
Handle different character encodings when parsing CSV files. Covers UTF-8, UTF-8 BOM, Shift-JIS, Latin-1, and encoding detection strategies.
Detailed Explanation
Character Encoding in CSV Files
CSV files have no built-in encoding declaration. The file is just raw bytes, and the parser must know or guess the correct encoding to interpret those bytes as text.
Common CSV encodings
| Encoding | Used by | Notes |
|---|---|---|
| UTF-8 | Modern tools, APIs, web | Universal, supports all Unicode characters |
| UTF-8 with BOM | Excel (Windows) | Starts with bytes EF BB BF |
| Shift-JIS | Japanese legacy systems | Windows code page 932 |
| ISO-8859-1 (Latin-1) | European legacy systems | Single-byte, 0-255 |
| Windows-1252 | Older Windows applications | Similar to Latin-1 with extras |
The BOM problem
Microsoft Excel on Windows saves CSV files with a UTF-8 BOM (Byte Order Mark) -- three invisible bytes at the start of the file (\xEF\xBB\xBF). If your parser doesn't strip the BOM, the first header name will be corrupted:
Expected: "name"
Actual: "\uFEFFname"
Always check for and strip the BOM before parsing:
function stripBom(text) {
return text.charCodeAt(0) === 0xFEFF ? text.slice(1) : text;
}
Shift-JIS and Japanese CSV files
CSV files from Japanese systems (government data, banking exports, legacy ERP) often use Shift-JIS encoding. Modern browsers support decoding via the TextDecoder API:
const decoder = new TextDecoder("shift-jis");
const text = decoder.decode(arrayBuffer);
If you decode a Shift-JIS file as UTF-8, Japanese characters become garbled (文字化け / mojibake). Always verify the encoding before parsing.
Encoding detection in the browser
The Web platform provides TextDecoder for known encodings, but automatic detection requires heuristics:
- Check for BOM. If present, the encoding is known.
- Try UTF-8 first. Most modern files are UTF-8. If decoding succeeds without replacement characters (
\uFFFD), it is likely correct. - Fall back to locale-specific encoding. Based on the file's origin, try Shift-JIS, Latin-1, etc.
- Let the user choose. When auto-detection fails, provide an encoding selector in the UI.
JSON output encoding
Regardless of the input CSV encoding, the JSON output should always be UTF-8. JSON is defined as UTF-8 by RFC 8259, so all non-ASCII characters are either output directly as UTF-8 bytes or escaped as \uXXXX sequences.
Use Case
Processing government open data CSV files from Japan that are published in Shift-JIS encoding and must be converted to UTF-8 JSON for a multilingual web application.