Regex for Matching and Extracting HTML Tags
Regex patterns for matching HTML tags, extracting tag names and attributes, and stripping HTML. Includes important caveats about parsing HTML with regex.
Detailed Explanation
HTML Tag Matching with Regex
While regex should not be used as a full HTML parser, it is useful for simple tag matching, extraction, and sanitization tasks.
Match Any HTML Tag
<[^>]+>
This matches any HTML tag (opening, closing, or self-closing):
<div>,</div>,<br />,<img src="..." />
Match Specific Tags
<(?:p|div|span)[^>]*>
Matches opening tags for specific elements. Use alternation inside a non-capturing group.
Extract Tag Name and Attributes
<(?<tag>\w+)(?<attrs>[^>]*)>
Groups:
tag: the element nameattrs: all attributes as a raw string
Extract Individual Attributes
(?<attr>\w+)=(?:"(?<val>[^"]*)"|'(?<val2>[^']*)')
Handles both double-quoted and single-quoted attribute values.
Strip All HTML Tags
str.replace(/<[^>]+>/g, "")
Removes all HTML tags, leaving only text content. Note: this does not handle all edge cases (like < in attribute values).
Match Self-Closing Tags
<\w+[^>]*/>
Matches tags like <br />, <img src="..." />, <input type="text" />.
Why Not Parse HTML with Regex
- HTML is not a regular language; it has recursive nesting
- Regex cannot handle nested tags correctly
- Malformed HTML breaks regex patterns
- Use a DOM parser (DOMParser, cheerio, jsdom) for reliable HTML processing
For simple extraction or sanitization of known-format HTML, regex works fine. For complex parsing, always use a proper parser.
Use Case
You need to strip HTML tags from user input for plain-text preview, extract specific elements from a known HTML structure, or sanitize HTML by removing unwanted tags while preserving content.