How to Strip and Remove Zalgo from Text
Learn techniques to programmatically remove Zalgo combining marks from text using regex, Unicode categories, and various programming languages.
Detailed Explanation
Removing Zalgo Text
Stripping Zalgo means removing all combining diacritical marks from text, restoring it to its clean, readable form. This is essential for content moderation, text processing, and data cleaning.
JavaScript Regex Approach
The most common approach uses a regex that matches the combining diacritical marks range:
function stripZalgo(text) {
return text.replace(/[\u0300-\u036f]/g, '');
}
// For broader coverage (extended combining marks):
function stripZalgoFull(text) {
return text.replace(
/[\u0300-\u036f\u1ab0-\u1aff\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]/g,
''
);
}
Python
import unicodedata
def strip_zalgo(text):
return ''.join(
c for c in text
if unicodedata.category(c) != 'Mn' # Mn = Mark, Nonspacing
)
Using Unicode Categories
The Unicode General Category Mn (Mark, Nonspacing) covers all combining marks. This is the most reliable approach as it does not depend on specific code point ranges:
// Using Unicode property escapes (modern JS):
function stripZalgo(text) {
return text.replace(/\p{Mn}/gu, '');
}
Preserving Legitimate Diacritics
A challenge: stripping ALL combining marks also removes legitimate accents (é, ñ, ü). To preserve legitimate diacritics while removing excess:
function stripExcessCombining(text, maxPerChar = 2) {
let result = '';
let combiningCount = 0;
for (const char of text) {
if (/\p{Mn}/u.test(char)) {
combiningCount++;
if (combiningCount <= maxPerChar) result += char;
} else {
combiningCount = 0;
result += char;
}
}
return result;
}
This limits combining marks to a maximum per character, preserving normal accented text while removing zalgo excess.
Use Case
Stripping Zalgo is essential for content moderation systems, chat applications, forum software, and any text processing pipeline that needs to handle user-generated content that may contain malicious or disruptive Unicode combining marks.