Decode HTML Entities During Markdown Conversion

Understand how HTML entities (&, <, >,  , ©) are decoded during HTML-to-Markdown conversion. Covers named entities, numeric entities, and special characters.

Real-World HTML

Detailed Explanation

HTML Entities to Markdown

HTML entities are special character sequences that represent characters which have special meaning in HTML or are not easily typed. During Markdown conversion, most entities must be decoded back to their literal characters.

Common Named Entities

<p>Tom &amp; Jerry</p>
<p>Price: 5 &lt; 10</p>
<p>&copy; 2024 Company</p>
<p>Hello&nbsp;&nbsp;&nbsp;World</p>

Converts to:

Tom & Jerry

Price: 5 < 10

(c) 2024 Company

Hello   World

Key decoding rules:

  • &amp; becomes &
  • &lt; becomes <
  • &gt; becomes >
  • &quot; becomes "
  • &apos; becomes '
  • &nbsp; becomes a regular space (or non-breaking space, depending on the converter)
  • &copy; becomes (c) or the Unicode character ©

Numeric Entities

HTML also uses numeric character references:

<p>&#169; Copyright</p>
<p>&#x2014; em dash</p>
<p>&#8364; Euro sign</p>

Converts to:

(c) Copyright

— em dash

€ Euro sign

&#NNN; is decimal and &#xHHH; is hexadecimal. Both are decoded to the corresponding Unicode character.

Entities Inside Code

Inside Markdown code spans and code blocks, entities should be decoded to literal characters since Markdown code is not processed:

<code>&lt;div&gt;</code>

Converts to:

`<div>`

The entity &lt; must be decoded to < because Markdown code blocks display content literally without HTML interpretation.

Entities in Attributes

Entities can appear in attribute values too:

<a href="page.html?a=1&amp;b=2">Link</a>

The &amp; in the URL must be decoded to & to produce a working link:

[Link](page.html?a=1&b=2)

Characters That Need Escaping in Markdown

After decoding entities, some characters may need to be escaped in Markdown to prevent them from being interpreted as syntax. For example, *, _, [, ], and # at the start of a line.

Use Case

Entity decoding is a fundamental step in any HTML-to-Markdown converter. It is especially important when processing content from older websites, XML-based CMS systems, or any HTML that was programmatically generated with heavy entity encoding.

Try It — HTML to Markdown

Open full tool