Encoding Detector

Detect character encoding of text and files by analyzing byte patterns. Supports UTF-8, ASCII, ISO-8859-1, Shift_JIS, and more.

About This Tool

The Encoding Detector analyzes text and file byte patterns to identify the character encoding being used. It supports detection of a wide range of encodings including UTF-8, ASCII, ISO-8859-1 (Latin-1), Windows-1252 (CP-1252), Shift_JIS, EUC-JP, and GB2312, along with BOM-based detection for UTF-16 and UTF-32 variants.

Character encoding determines how bytes are mapped to characters. When a file is opened with the wrong encoding, you see garbled text known as mojibake. This is a common problem when transferring files between different operating systems, reading legacy databases, or processing text from international sources. Identifying the correct encoding is the first step to fixing these issues.

The detector works by examining byte-level patterns. It first checks for a Byte Order Mark (BOM), which is a special sequence of bytes at the beginning of a file that unambiguously identifies the encoding. If no BOM is found, the tool uses heuristic analysis: it validates whether the byte sequences conform to UTF-8 multi-byte rules, checks for byte ranges characteristic of ISO-8859-1 or Windows-1252, and tests for valid lead-byte / trail-byte pairs used in Japanese (Shift_JIS, EUC-JP) and Chinese (GB2312) encodings.

Each candidate encoding is assigned a confidence percentage based on how well the byte data matches the expected patterns. Results are sorted by confidence, making it easy to identify the most likely encoding. The tool also provides a hex dump of the first bytes for manual inspection.

All processing happens entirely in your browser. Your text and files are never uploaded to any server. The file is read as an ArrayBuffer using the File API and analyzed byte-by-byte in JavaScript.

How to Use

  1. Choose Text mode to paste text directly, or File mode to analyze a file.
  2. In Text mode, paste your content into the text area. The encoding analysis runs automatically as you type or paste.
  3. In File mode, drag and drop a file onto the drop zone, or click "browse" to select a file from your computer.
  4. Review the results table showing each detected encoding with its confidence score and description.
  5. Scroll down to view the hex dump to inspect the raw byte values of the first bytes in your data.
  6. Click Copy (or press Ctrl+Shift+C ) to copy the detection results to your clipboard.

FAQ

Is my data safe?

Yes. All encoding detection runs entirely in your browser using JavaScript. No data is sent to any server. Your text and files stay on your machine.

Why does pasted text always show as UTF-8?

When you paste text into a browser text area, the browser converts it to its internal string representation (UTF-16). When the tool encodes this string to bytes for analysis, it uses the TextEncoder API which always produces UTF-8. To detect the original encoding of a file, use the File mode instead.

What is a Byte Order Mark (BOM)?

A BOM is a special Unicode character (U+FEFF) placed at the beginning of a file to signal its encoding and byte order. For example, a UTF-8 BOM is the three-byte sequence EF BB BF, while a UTF-16 LE BOM is FF FE. When a BOM is present, encoding detection is 100% certain.

How accurate is the detection?

Detection accuracy depends on the data. Files with a BOM are detected with 100% confidence. For UTF-8 text with multi-byte characters (e.g., accented letters, CJK), accuracy is very high. Short ASCII-only strings are ambiguous since ASCII is a valid subset of many encodings. The confidence percentage reflects the strength of the heuristic match.

What is the difference between ISO-8859-1 and Windows-1252?

ISO-8859-1 (Latin-1) and Windows-1252 are both single-byte Western European encodings. They are identical for byte values 0xA0-0xFF, but differ in the 0x80-0x9F range. ISO-8859-1 maps these to control characters, while Windows-1252 maps them to printable characters like curly quotes, em dashes, and the euro sign. In practice, many files labeled as ISO-8859-1 are actually Windows-1252.

Can I analyze large files?

Yes. The file is read into memory as an ArrayBuffer and analyzed in JavaScript. Files up to several hundred megabytes work well. Very large files may be limited by your browser's available memory. The hex dump shows only the first 160 bytes regardless of file size.

Related Tools