Encoding Unicode/UTF-8 Text as Base64

Handle Unicode and UTF-8 characters correctly when Base64 encoding. Learn the UTF-8-first approach, browser quirks with btoa(), and cross-platform solutions.

Encoding

Detailed Explanation

Base64 operates on bytes, not characters. When your text contains characters outside the ASCII range (accented letters, CJK characters, emoji, mathematical symbols), you must first convert the text to a byte sequence using a character encoding -- almost always UTF-8 -- before Base64 encoding.

The problem with btoa():

The browser's btoa() function accepts a string but treats each character as a single byte (Latin-1 encoding). Characters with code points above 255 cause an InvalidCharacterError:

btoa("Hello");     // Works: "SGVsbG8="
btoa("café");    // Error! ́ is above Latin-1
btoa("😀");      // Error! Emoji code point is way above 255

The solution: encode to UTF-8 first

Modern approach using TextEncoder:

function encodeUnicode(str) {
  const utf8Bytes = new TextEncoder().encode(str);
  const binaryString = Array.from(utf8Bytes, b => String.fromCharCode(b)).join("");
  return btoa(binaryString);
}

function decodeUnicode(base64) {
  const binaryString = atob(base64);
  const bytes = Uint8Array.from(binaryString, c => c.charCodeAt(0));
  return new TextDecoder().decode(bytes);
}

encodeUnicode("Hello 😀"); // "SGVsbG8g8J+YgA=="
decodeUnicode("SGVsbG8g8J+YgA=="); // "Hello 😀"

Legacy approach (still widely used):

function utoa(str) {
  return btoa(unescape(encodeURIComponent(str)));
}
function atou(b64) {
  return decodeURIComponent(escape(atob(b64)));
}

This works because encodeURIComponent() converts the string to UTF-8 percent-encoded bytes, and unescape() converts those percent sequences back to single-byte characters that btoa() can handle.

In Python, it just works:

import base64
text = "Hello 😀"
encoded = base64.b64encode(text.encode("utf-8")).decode("ascii")
decoded = base64.b64decode(encoded).decode("utf-8")

Python naturally separates the string-to-bytes step (.encode("utf-8")) from the Base64 step (b64encode()), making the process explicit and less error-prone.

Cross-platform interoperability: The critical rule is that both the encoder and decoder must agree on the character encoding used before Base64. If you encode with UTF-8 on the client but decode assuming Latin-1 on the server, multi-byte characters will be corrupted. UTF-8 is the universal default; always use it unless you have a specific reason not to.

Common mistake: Assuming Base64 handles Unicode natively. Base64 knows nothing about character encodings. It processes raw bytes. The character encoding step must happen before encoding and after decoding.

Use Case

Encoding user-submitted form data containing international characters (Chinese, Arabic, emoji) as Base64 for inclusion in a URL query parameter.

Try It — Base64 Encoder

Open full tool →

Encoding Unicode/UTF-8 Text as Base64

Detailed Explanation

Use Case

Try It — Base64 Encoder

Related Topics