Convert Scraped Web HTML to Structured Markdown

Convert raw HTML from web scraping to well-structured Markdown. Handle noisy markup, advertisements, navigation elements, and extract the main content for clean output.

Real-World HTML

Detailed Explanation

Web Scraping HTML to Markdown

Web scraping produces raw HTML that includes navigation menus, advertisements, footer content, script tags, and other elements that should not appear in the Markdown output. Converting scraped HTML requires filtering and content extraction.

Removing Non-Content Elements

Before conversion, strip elements that do not contribute to the main content:

<html>
  <head>
    <title>Article Title</title>
    <style>.ad { display: block; }</style>
    <script>analytics.track("page");</script>
  </head>
  <body>
    <nav>
      <a href="/">Home</a>
      <a href="/about">About</a>
    </nav>
    <main>
      <article>
        <h1>Article Title</h1>
        <p>The main content of the article.</p>
      </article>
    </main>
    <aside class="ad">
      <p>Advertisement content</p>
    </aside>
    <footer>
      <p>&copy; 2024 Example</p>
    </footer>
  </body>
</html>

Should extract and convert only the <main> or <article> content:

# Article Title

The main content of the article.

Elements to Strip

  • <script> and <noscript> — JavaScript code
  • <style> — CSS rules
  • <nav> — navigation menus
  • <header> and <footer> — site-level headers and footers
  • <aside> — sidebars and advertisements
  • <iframe> — embedded frames (unless specifically needed)
  • Hidden elements — display: none, visibility: hidden, aria-hidden="true"

Content Extraction Strategies

  1. Target the article element — look for <article>, <main>, or a <div> with a content-like class (content, post-body, entry-content)
  2. Use readability algorithms — libraries like Mozilla Readability score elements by content density
  3. Manual CSS selectors — specify which elements to include or exclude

Cleaning Up the Output

After initial conversion, post-processing may be needed:

  • Remove duplicate blank lines
  • Fix broken links (relative to absolute)
  • Remove empty headings or paragraphs
  • Normalize heading levels (if the article starts with h2, consider making it h1)

Handling Dynamic Content

JavaScript-rendered content (SPAs, React apps) requires a headless browser to produce the HTML before conversion. The scraped HTML from a simple HTTP request may be incomplete or empty for dynamic sites.

Batch Processing

When scraping multiple pages, establish consistent conversion rules:

  • Same content selectors across the site
  • Uniform heading level normalization
  • Consistent image URL resolution
  • Template for front matter (title, date, author, tags)

Use Case

Web scraping to Markdown is used for content aggregation, building knowledge bases, archiving web content, migrating entire websites to static site generators, and creating offline-readable documentation from online resources.

Try It — HTML to Markdown

Open full tool