Calculate Byte Size of Text — UTF-8, UTF-16, ASCII
Calculate the byte size of any text in UTF-8, UTF-16, and ASCII encodings. Learn how different character encodings affect storage size and why the same text can have different byte sizes across encodings.
Detailed Explanation
Text Byte Size Calculation
The byte size of text depends entirely on the character encoding used. The same string can occupy dramatically different amounts of storage depending on whether it is encoded as UTF-8, UTF-16, or ASCII.
Calculating Byte Size in JavaScript
The TextEncoder API provides accurate UTF-8 byte counts:
function getByteSize(text) {
const encoder = new TextEncoder(); // defaults to UTF-8
const encoded = encoder.encode(text);
return encoded.byteLength;
}
For multiple encodings:
function getByteSizes(text) {
const utf8 = new TextEncoder().encode(text).byteLength;
const utf16 = text.length * 2; // approximate
const ascii = text.replace(/[^\x00-\x7F]/g, "").length;
return { utf8, utf16, ascii };
}
Encoding Comparison
| Character | UTF-8 | UTF-16 | ASCII |
|---|---|---|---|
A (U+0041) |
1 byte | 2 bytes | 1 byte |
é (U+00E9) |
2 bytes | 2 bytes | N/A |
世 (U+4E16) |
3 bytes | 2 bytes | N/A |
| Emoji (U+1F600) | 4 bytes | 4 bytes | N/A |
UTF-8 Variable-Width Encoding
UTF-8 uses 1-4 bytes per character:
- 1 byte: U+0000 to U+007F (ASCII compatible) — English letters, digits, basic punctuation
- 2 bytes: U+0080 to U+07FF — accented characters, Greek, Cyrillic, Arabic, Hebrew
- 3 bytes: U+0800 to U+FFFF — CJK characters, most symbols
- 4 bytes: U+10000 to U+10FFFF — emoji, historic scripts, musical notation
This variable width makes UTF-8 extremely efficient for English-dominated text but less efficient for CJK-heavy content.
Why Byte Size Matters
- Database storage — VARCHAR(255) in MySQL means 255 bytes in UTF-8, which may be fewer than 255 characters
- API payloads — many APIs limit request/response body size in bytes, not characters
- File size estimation — predicting storage requirements for text data
- Network bandwidth — byte size determines transmission time
- Cookie limits — browser cookies are limited to ~4,096 bytes total
BOM (Byte Order Mark)
UTF-8 files sometimes start with a BOM (\xEF\xBB\xBF, 3 bytes). UTF-16 files use \xFF\xFE or \xFE\xFF (2 bytes). These invisible markers add to the byte count but are not visible characters.
Use Case
Backend developers calculating database storage requirements use byte size to choose appropriate column types. Frontend developers building form validation need byte-aware limits for API fields. DevOps engineers estimating log storage costs, and data engineers designing ETL pipelines that process text data in specific encodings all rely on accurate byte size calculations.