Question 1

What is the difference between characters, code points, and grapheme clusters?

Accepted Answer

In JavaScript, .length returns the number of UTF-16 code units, not characters. A code point is a single Unicode value (e.g., U+1F600 for a smiley face). A grapheme cluster is what a human perceives as a single character — it can consist of multiple code points (e.g., a flag emoji is two regional indicator code points). For ASCII text, all three counts are identical, but for emoji and non-Latin scripts they can differ significantly.

Question 2

Why does my emoji show different lengths for .length and code points?

Accepted Answer

Emoji above U+FFFF (like 😀) require two UTF-16 code units (a surrogate pair), so JavaScript's .length counts them as 2. Complex emoji sequences like family emojis use Zero Width Joiners (ZWJ) to combine multiple emoji, resulting in many code units but just one visual grapheme. The code point count and grapheme cluster count in this tool give you the more useful measurements.

Question 3

Which count should I use for database VARCHAR limits?

Accepted Answer

It depends on your database and encoding. PostgreSQL VARCHAR(n) counts characters (code points). MySQL VARCHAR(n) with utf8mb4 also counts characters. However, MySQL's TEXT type limits are in bytes. For byte-limited columns, use the UTF-8 byte count. Always check your database documentation to know whether limits are in characters or bytes.

Question 4

How is the grapheme cluster count calculated?

Accepted Answer

This tool uses the Intl.Segmenter API (available in modern browsers) with grapheme granularity. This correctly handles complex emoji sequences, combining marks, and other multi-code-point graphemes according to the Unicode segmentation rules. In older browsers without Intl.Segmenter, it falls back to splitting by code points, which may not be accurate for complex emoji.

Question 5

What are surrogate pairs?

Accepted Answer

UTF-16 uses 2 bytes per code unit. Characters with code points above U+FFFF (like most emoji and some CJK characters) cannot fit in a single 16-bit code unit, so they are encoded as a pair of code units called a surrogate pair. This is why JavaScript's .length returns 2 for a single emoji. The tool highlights surrogate pairs in orange in the grapheme breakdown.

Question 6

Is my data safe?

Accepted Answer

Yes. All processing runs entirely in your browser using JavaScript. No text is sent to any server. You can verify this by checking the Network tab in your browser's developer tools while using the tool.

Question 7

Why does the same text have different byte sizes in UTF-8 and UTF-16?

Accepted Answer

UTF-8 and UTF-16 are variable-length encodings with different strategies. UTF-8 uses 1 byte for ASCII, 2-3 bytes for most other scripts, and 4 bytes for emoji. UTF-16 uses 2 bytes for most characters and 4 bytes (a surrogate pair) for characters above U+FFFF. For English text, UTF-8 is more compact. For CJK text, UTF-16 is often smaller. UTF-32 always uses 4 bytes per code point, regardless of the character.

String Length Calculator

About This Tool

How to Use

Popular String Length Examples

FAQ

Related Tools

Word & Character Counter

Unicode Inspector

String Escape/Unescape

Text Case Converter

Whitespace Visualizer

LLM Token Counter