Regex to Extract HTML Tags and Attributes
Regex patterns to extract HTML tag names, attributes, and content. Useful for sanitization, link extraction, and templating tasks where a full DOM parser is overkill.
Detailed Explanation
Extracting HTML Tags
Regex cannot parse arbitrary HTML, but for predictable, well-formed snippets it is fine for tag and attribute extraction. For real-world HTML, prefer a parser like cheerio, jsdom, or DOMParser.
Match Any Tag
<\/?[a-zA-Z][\w-]*(?:\s[^>]*)?\/?>
Capture Tag Name
<\/?(?<tag>[a-zA-Z][\w-]*)
In <a href="...">, captures a.
Extract All Attributes
(?<name>[\w-]+)(?:\s*=\s*(?:"(?<dq>[^"]*)"|'(?<sq>[^']*)'|(?<bare>[^\s>]+)))?
Run with the g flag against the contents of a tag. Each match captures attribute name and value (one of dq, sq, or bare).
Specific Tag with Inner Content
To extract <a> link text and href:
<a\s+[^>]*href=["'](?<href>[^"']+)["'][^>]*>(?<text>.*?)<\/a>
Tested Examples
| Input | Tag | Attributes |
|---|---|---|
<div class="box"> |
div | class=box |
<img src="a.png" alt="A"> |
img | src=a.png, alt=A |
<input disabled type='text'> |
input | disabled (bare), type=text |
<a href="/x">click</a> |
a, /a | href=/x |
Self-Closing Tags
<[a-zA-Z][\w-]*(?:\s[^>]*)?\/>
Matches <br/>, <img src="a"/>, <hr />.
Limitations
- Attribute values containing
>break naive patterns - Nested tags (
<div><div></div></div>) cannot be matched correctly - HTML comments
<!-- ... -->need a separate pattern - Script/style tag content can include
<and confuse patterns
For sanitization, use DOMPurify rather than regex.
Use Case
Extracting all `<a href>` URLs from a Markdown render, finding `<img>` tags missing `alt` attributes during accessibility audits, or pulling specific tag content from email templates.