CSV Character Encoding: UTF-8, Shift-JIS, and More

Handle different character encodings when parsing CSV files. Covers UTF-8, UTF-8 BOM, Shift-JIS, Latin-1, and encoding detection strategies.

Data Types

Detailed Explanation

Character Encoding in CSV Files

CSV files have no built-in encoding declaration. The file is just raw bytes, and the parser must know or guess the correct encoding to interpret those bytes as text.

Common CSV encodings

Encoding Used by Notes
UTF-8 Modern tools, APIs, web Universal, supports all Unicode characters
UTF-8 with BOM Excel (Windows) Starts with bytes EF BB BF
Shift-JIS Japanese legacy systems Windows code page 932
ISO-8859-1 (Latin-1) European legacy systems Single-byte, 0-255
Windows-1252 Older Windows applications Similar to Latin-1 with extras

The BOM problem

Microsoft Excel on Windows saves CSV files with a UTF-8 BOM (Byte Order Mark) -- three invisible bytes at the start of the file (\xEF\xBB\xBF). If your parser doesn't strip the BOM, the first header name will be corrupted:

Expected: "name"
Actual:   "\uFEFFname"

Always check for and strip the BOM before parsing:

function stripBom(text) {
  return text.charCodeAt(0) === 0xFEFF ? text.slice(1) : text;
}

Shift-JIS and Japanese CSV files

CSV files from Japanese systems (government data, banking exports, legacy ERP) often use Shift-JIS encoding. Modern browsers support decoding via the TextDecoder API:

const decoder = new TextDecoder("shift-jis");
const text = decoder.decode(arrayBuffer);

If you decode a Shift-JIS file as UTF-8, Japanese characters become garbled (文字化け / mojibake). Always verify the encoding before parsing.

Encoding detection in the browser

The Web platform provides TextDecoder for known encodings, but automatic detection requires heuristics:

  1. Check for BOM. If present, the encoding is known.
  2. Try UTF-8 first. Most modern files are UTF-8. If decoding succeeds without replacement characters (\uFFFD), it is likely correct.
  3. Fall back to locale-specific encoding. Based on the file's origin, try Shift-JIS, Latin-1, etc.
  4. Let the user choose. When auto-detection fails, provide an encoding selector in the UI.

JSON output encoding

Regardless of the input CSV encoding, the JSON output should always be UTF-8. JSON is defined as UTF-8 by RFC 8259, so all non-ASCII characters are either output directly as UTF-8 bytes or escaped as \uXXXX sequences.

Use Case

Processing government open data CSV files from Japan that are published in Shift-JIS encoding and must be converted to UTF-8 JSON for a multilingual web application.

Try It — CSV ↔ JSON Converter

Open full tool