Regex for Matching and Extracting HTML Tags

Regex patterns for matching HTML tags, extracting tag names and attributes, and stripping HTML. Includes important caveats about parsing HTML with regex.

Common Patterns

Detailed Explanation

HTML Tag Matching with Regex

While regex should not be used as a full HTML parser, it is useful for simple tag matching, extraction, and sanitization tasks.

Match Any HTML Tag

<[^>]+>

This matches any HTML tag (opening, closing, or self-closing):

  • <div>, </div>, <br />, <img src="..." />

Match Specific Tags

<(?:p|div|span)[^>]*>

Matches opening tags for specific elements. Use alternation inside a non-capturing group.

Extract Tag Name and Attributes

<(?<tag>\w+)(?<attrs>[^>]*)>

Groups:

  • tag: the element name
  • attrs: all attributes as a raw string

Extract Individual Attributes

(?<attr>\w+)=(?:"(?<val>[^"]*)"|'(?<val2>[^']*)')

Handles both double-quoted and single-quoted attribute values.

Strip All HTML Tags

str.replace(/<[^>]+>/g, "")

Removes all HTML tags, leaving only text content. Note: this does not handle all edge cases (like < in attribute values).

Match Self-Closing Tags

<\w+[^>]*/>

Matches tags like <br />, <img src="..." />, <input type="text" />.

Why Not Parse HTML with Regex

  • HTML is not a regular language; it has recursive nesting
  • Regex cannot handle nested tags correctly
  • Malformed HTML breaks regex patterns
  • Use a DOM parser (DOMParser, cheerio, jsdom) for reliable HTML processing

For simple extraction or sanitization of known-format HTML, regex works fine. For complex parsing, always use a proper parser.

Use Case

You need to strip HTML tags from user input for plain-text preview, extract specific elements from a known HTML structure, or sanitize HTML by removing unwanted tags while preserving content.

Try It — Regex Cheat Sheet

Open full tool