Regex to Extract Hashtags from Text
Regex to extract #hashtags from social media posts, blog content, and notes. Supports ASCII, Unicode (Japanese, emoji), and underscore-allowed variants.
Detailed Explanation
Extracting Hashtags
Hashtags appear in social posts, note-taking apps, and blog content. The basic shape is # followed by a sequence of letters and digits, but rules vary across platforms.
ASCII Hashtags
(?:^|\s)(#[A-Za-z0-9_]+)
The leading group ensures #tag is at the start of the text or after whitespace, so color#abc (a CSS color) is not matched.
Unicode Hashtags (Japanese, Emoji-Friendly)
(?:^|\s)(#[\p{L}\p{N}_\p{Extended_Pictographic}]+)
Requires the u flag. Matches #プログラミング and hashtags containing emoji.
Tested Examples
| Input | ASCII | Unicode |
|---|---|---|
"Loving #JavaScript today" |
#JavaScript |
#JavaScript |
"Multiple #tags #here #and-here" |
#tags, #here, #and |
same |
"#【news】 not a tag because of bracket" |
— | — |
"#プログラミング" |
— | #プログラミング |
"order #1234" |
— (digits-only often excluded) | — |
Reject Numeric-Only Tags
Some platforms ignore #1234. Add a lookahead requiring at least one letter:
(?:^|\s)(#(?=\w*[A-Za-z])[A-Za-z0-9_]+)
JavaScript Extraction
const tags = [...text.matchAll(/(?:^|\s)(#[\p{L}\p{N}_]+)/gu)]
.map(m => m[1]);
Counting Hashtag Frequency
const counts = tags.reduce((acc, t) => (acc[t] = (acc[t] ?? 0) + 1, acc), {});
Practical Notes
Match Twitter’s rules carefully if you are mirroring its behavior: hashtags can contain letters, digits, and underscores, but cannot be entirely numeric, and the maximum length is platform-defined.
Use Case
Extracting topical tags from blog post bodies for tag-cloud generation, analyzing social media exports for trending hashtags, or auto-suggesting tags based on note content.