Decode HTML Entities During Markdown Conversion
Understand how HTML entities (&, <, >, , ©) are decoded during HTML-to-Markdown conversion. Covers named entities, numeric entities, and special characters.
Detailed Explanation
HTML Entities to Markdown
HTML entities are special character sequences that represent characters which have special meaning in HTML or are not easily typed. During Markdown conversion, most entities must be decoded back to their literal characters.
Common Named Entities
<p>Tom & Jerry</p>
<p>Price: 5 < 10</p>
<p>© 2024 Company</p>
<p>Hello World</p>
Converts to:
Tom & Jerry
Price: 5 < 10
(c) 2024 Company
Hello World
Key decoding rules:
&becomes&<becomes<>becomes>"becomes"'becomes' becomes a regular space (or non-breaking space, depending on the converter)©becomes(c)or the Unicode character©
Numeric Entities
HTML also uses numeric character references:
<p>© Copyright</p>
<p>— em dash</p>
<p>€ Euro sign</p>
Converts to:
(c) Copyright
— em dash
€ Euro sign
&#NNN; is decimal and &#xHHH; is hexadecimal. Both are decoded to the corresponding Unicode character.
Entities Inside Code
Inside Markdown code spans and code blocks, entities should be decoded to literal characters since Markdown code is not processed:
<code><div></code>
Converts to:
`<div>`
The entity < must be decoded to < because Markdown code blocks display content literally without HTML interpretation.
Entities in Attributes
Entities can appear in attribute values too:
<a href="page.html?a=1&b=2">Link</a>
The & in the URL must be decoded to & to produce a working link:
[Link](page.html?a=1&b=2)
Characters That Need Escaping in Markdown
After decoding entities, some characters may need to be escaped in Markdown to prevent them from being interpreted as syntax. For example, *, _, [, ], and # at the start of a line.
Use Case
Entity decoding is a fundamental step in any HTML-to-Markdown converter. It is especially important when processing content from older websites, XML-based CMS systems, or any HTML that was programmatically generated with heavy entity encoding.
Try It — HTML to Markdown
Related Topics
Convert HTML Code Elements to Markdown Code Syntax
Media & Links
Convert HTML Bold and Italic to Markdown Emphasis
Text Formatting
Convert HTML Links and Images to Markdown Syntax
Media & Links
Handle HTML Inline Styles in Markdown Conversion
Text Formatting
Convert Scraped Web HTML to Structured Markdown
Real-World HTML