Regex to Extract HTML Tags and Attributes

Regex patterns to extract HTML tag names, attributes, and content. Useful for sanitization, link extraction, and templating tasks where a full DOM parser is overkill.

Extraction

Detailed Explanation

Extracting HTML Tags

Regex cannot parse arbitrary HTML, but for predictable, well-formed snippets it is fine for tag and attribute extraction. For real-world HTML, prefer a parser like cheerio, jsdom, or DOMParser.

Match Any Tag

<\/?[a-zA-Z][\w-]*(?:\s[^>]*)?\/?>

Capture Tag Name

<\/?(?<tag>[a-zA-Z][\w-]*)

In <a href="...">, captures a.

Extract All Attributes

(?<name>[\w-]+)(?:\s*=\s*(?:"(?<dq>[^"]*)"|'(?<sq>[^']*)'|(?<bare>[^\s>]+)))?

Run with the g flag against the contents of a tag. Each match captures attribute name and value (one of dq, sq, or bare).

Specific Tag with Inner Content

To extract <a> link text and href:

<a\s+[^>]*href=["'](?<href>[^"']+)["'][^>]*>(?<text>.*?)<\/a>

Tested Examples

Input Tag Attributes
<div class="box"> div class=box
<img src="a.png" alt="A"> img src=a.png, alt=A
<input disabled type='text'> input disabled (bare), type=text
<a href="/x">click</a> a, /a href=/x

Self-Closing Tags

<[a-zA-Z][\w-]*(?:\s[^>]*)?\/>

Matches <br/>, <img src="a"/>, <hr />.

Limitations

  • Attribute values containing > break naive patterns
  • Nested tags (<div><div></div></div>) cannot be matched correctly
  • HTML comments <!-- ... --> need a separate pattern
  • Script/style tag content can include < and confuse patterns

For sanitization, use DOMPurify rather than regex.

Use Case

Extracting all `<a href>` URLs from a Markdown render, finding `<img>` tags missing `alt` attributes during accessibility audits, or pulling specific tag content from email templates.

Try It — Regex Cheat Sheet

Open full tool