html-to-markdown

High-performance HTML → Markdown conversion powered by Rust. Shipping as a Rust crate, Python package, PHP extension, Ruby gem, Elixir Rustler NIF, Node.js bindings, WebAssembly, and standalone CLI with identical rendering behaviour.

🎮 Try the Live Demo →

Experience WebAssembly-powered HTML to Markdown conversion instantly in your browser. No installation needed!

Why html-to-markdown?

Blazing Fast: Rust-powered core delivers 10-80× faster conversion than pure Python alternatives
Universal: Works everywhere - Node.js, Bun, Deno, browsers, Python, Rust, and standalone CLI
Smart Conversion: Handles complex documents including nested tables, code blocks, task lists, and hOCR OCR output
Metadata Extraction: Extract document metadata (title, description, headers, links, images) alongside conversion
Highly Configurable: Control heading styles, code block fences, list formatting, whitespace handling, and HTML sanitization
Tag Preservation: Keep specific HTML tags unconverted when markdown isn't expressive enough
Secure by Default: Built-in HTML sanitization prevents malicious content
Consistent Output: Identical markdown rendering across all language bindings

Documentation

Language Guides & API References:

Python – README with metadata extraction, inline images, hOCR workflows
JavaScript/TypeScript – Node.js | TypeScript | WASM
Ruby – README with RBS types, Steep type checking
PHP – Package | Extension (PIE)
Go – README with FFI bindings
Java – README with Panama FFI, Maven/Gradle setup
C#/.NET – README with NuGet distribution
Elixir – README with Rustler NIF bindings
Rust – README with core API, error handling, advanced features

Project Resources:

Contributing – CONTRIBUTING.md ⭐ Start here for development
Changelog – CHANGELOG.md – Version history and breaking changes

Installation

Target	Command(s)
Node.js/Bun (native)	`npm install html-to-markdown-node`
WebAssembly (universal)	`npm install html-to-markdown-wasm`
Deno	`import { convert } from "npm:html-to-markdown-wasm"`
Python (bindings + CLI)	`pip install html-to-markdown`
PHP (extension + helpers)	`PHP_EXTENSION_DIR=$(php-config --extension-dir) pie install goldziher/html-to-markdown` `composer require goldziher/html-to-markdown`
Ruby gem	`bundle add html-to-markdown` or `gem install html-to-markdown`
Elixir (Rustler NIF)	`{:html_to_markdown, "~> 2.8"}`
Rust crate	`cargo add html-to-markdown-rs`
Rust CLI (crates.io)	`cargo install html-to-markdown-cli`
Homebrew CLI	`brew install html-to-markdown` (core)
Releases	GitHub Releases

Quick Start

JavaScript/TypeScript

Node.js / Bun (Native - Fastest):

import { convert } from 'html-to-markdown-node';

const html = '<h1>Hello</h1><p>Rust ❤️ Markdown</p>';
const markdown = convert(html, {
  headingStyle: 'Atx',
  codeBlockStyle: 'Backticks',
  wrap: true,
  preserveTags: ['table'], // NEW in v2.5: Keep complex HTML as-is
});

Deno / Browsers / Edge (Universal):

import { convert } from "npm:html-to-markdown-wasm"; // Deno
// or: import { convert } from 'html-to-markdown-wasm'; // Bundlers

const markdown = convert(html, {
  headingStyle: 'atx',
  listIndentWidth: 2,
});

Performance: The shared fixture harness (task bench:bindings) now clocks C# at ~1.4k ops/sec (≈171 MB/s), Go at ~1.3k ops/sec (≈165 MB/s), Node, Python, and the Rust CLI at ~1.3–1.4k ops/sec (≈150 MB/s) on the 129 KB Wikipedia "Lists" page thanks to the new Buffer/Uint8Array fast paths and release-mode harness. Ruby stays close at ~1.2k ops/sec (≈150 MB/s), Java lands at ~1.0k ops/sec (≈126 MB/s), WASM hits ~0.85k ops/sec (≈108 MB/s), and PHP achieves ~0.3k ops/sec (≈35 MB/s)—all providing excellent throughput for production workloads.

See the JavaScript guides for full API documentation:

Metadata extraction (all languages)

import { convertWithMetadata } from 'html-to-markdown-node';

const html = `
  <html>
    <head>
      <title>Example</title>
      <meta name="description" content="Demo page">
      <link rel="canonical" href="https://example.com/page">
    </head>
    <body>
      <h1 id="welcome">Welcome</h1>
      <a href="https://example.com" rel="nofollow external">Example link</a>
      <img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
    </body>
  </html>
`;

const { markdown, metadata } = await convertWithMetadata(
  html,
  { headingStyle: 'Atx' },
  { extract_links: true, extract_images: true, extract_headers: true },
);

console.log(markdown);
// metadata.document.title === 'Example'
// metadata.links[0].rel === ['nofollow', 'external']
// metadata.images[0].dimensions === [640, 480]

Equivalent APIs are available in every binding:

Python: convert_with_metadata(html, options=None, metadata_config=None)
Ruby: HtmlToMarkdown.convert_with_metadata(html, options = nil, metadata_config = nil)
PHP: convert_with_metadata(string $html, ?array $options = null, ?array $metadataConfig = null)

CLI

# Convert a file
html-to-markdown input.html > output.md

# Stream from stdin
curl https://example.com | html-to-markdown > output.md

# Apply options
html-to-markdown --heading-style atx --list-indent-width 2 input.html

# Fetch a remote page (HTTP) with optional custom User-Agent
html-to-markdown --url https://example.com > output.md
html-to-markdown --url https://example.com --user-agent "Mozilla/5.0" > output.md

Metadata Extraction

Extract document metadata alongside HTML-to-Markdown conversion. All bindings support identical APIs:

CLI Examples

# Basic metadata extraction with conversion
html-to-markdown input.html --with-metadata -o output.json

# Extract document metadata (title, description, language, etc.)
html-to-markdown input.html --with-metadata --extract-document

# Extract headers and links
html-to-markdown input.html --with-metadata --extract-headers --extract-links

# Extract all metadata types with conversion
html-to-markdown input.html --with-metadata \
  --extract-document \
  --extract-headers \
  --extract-links \
  --extract-images \
  --extract-structured-data \
  -o metadata.json

# Fetch and extract from remote URL
html-to-markdown --url https://example.com --with-metadata -o output.json

# Web scraping with preprocessing and metadata
html-to-markdown page.html --preprocess --preset aggressive \
  --with-metadata --extract-links --extract-images

Output format (JSON):

{
  "markdown": "# Title\n\nContent here...",
  "metadata": {
    "document": {
      "title": "Page Title",
      "description": "Meta description",
      "charset": "utf-8",
      "language": "en"
    },
    "headers": [
      { "level": 1, "text": "Title", "id": "title" }
    ],
    "links": [
      {
        "text": "Example",
        "href": "https://example.com",
        "title": null,
        "rel": ["external"]
      }
    ],
    "images": [
      {
        "src": "https://example.com/image.jpg",
        "alt": "Hero image",
        "title": null,
        "dimensions": [640, 480]
      }
    ]
  }
}

Python Example

from html_to_markdown import convert_with_metadata

html = '''
<html>
  <head>
    <title>Product Guide</title>
    <meta name="description" content="Complete product documentation">
  </head>
  <body>
    <h1>Getting Started</h1>
    <p>Visit our <a href="https://example.com">website</a> for more.</p>
    <img src="https://example.com/guide.jpg" alt="Setup diagram" width="800" height="600">
  </body>
</html>
'''

markdown, metadata = convert_with_metadata(
    html,
    options={'heading_style': 'Atx'},
    metadata_config={
        'extract_document': True,
        'extract_headers': True,
        'extract_links': True,
        'extract_images': True,
    }
)

print(markdown)
print(f"Title: {metadata['document']['title']}")
print(f"Links found: {len(metadata['links'])}")

TypeScript/Node.js Example

import { convertWithMetadata } from 'html-to-markdown-node';

const html = `
  <html>
    <head>
      <title>Article</title>
      <meta name="description" content="Tech article">
    </head>
    <body>
      <h1>Web Performance</h1>
      <p>Read our <a href="/blog">blog</a> for tips.</p>
      <img src="/perf.png" alt="Chart" width="1200" height="630">
    </body>
  </html>
`;

const { markdown, metadata } = await convertWithMetadata(html, {
  headingStyle: 'Atx',
}, {
  extract_document: true,
  extract_headers: true,
  extract_links: true,
  extract_images: true,
});

console.log(markdown);
console.log(`Found ${metadata.headers.length} headers`);
console.log(`Found ${metadata.links.length} links`);

Ruby Example

require 'html_to_markdown'

html = <<~HTML
  <html>
    <head>
      <title>Documentation</title>
      <meta name="description" content="API Reference">
    </head>
    <body>
      <h2>Installation</h2>
      <p>See our <a href="https://github.com">GitHub</a>.</p>
      <img src="https://example.com/diagram.svg" alt="Architecture" width="960" height="540">
    </body>
  </html>
HTML

markdown, metadata = HtmlToMarkdown.convert_with_metadata(
  html,
  options: { heading_style: :atx },
  metadata_config: {
    extract_document: true,
    extract_headers: true,
    extract_links: true,
    extract_images: true,
  }
)

puts markdown
puts "Title: #{metadata[:document][:title]}"
puts "Images: #{metadata[:images].length}"

PHP Example

<?php
use HtmlToMarkdown\HtmlToMarkdown;

$html = <<<HTML
<html>
  <head>
    <title>Tutorial</title>
    <meta name="description" content="Step-by-step guide">
  </head>
  <body>
    <h1>Getting Started</h1>
    <p>Check our <a href="https://example.com/guide">guide</a>.</p>
    <img src="https://example.com/steps.png" alt="Steps" width="1024" height="768">
  </body>
</html>
HTML;

[$markdown, $metadata] = convert_with_metadata(
    $html,
    options: ['heading_style' => 'Atx'],
    metadataConfig: [
        'extract_document' => true,
        'extract_headers' => true,
        'extract_links' => true,
        'extract_images' => true,
    ]
);

echo "Title: " . $metadata['document']['title'] . "\n";
echo "Found " . count($metadata['links']) . " links\n";

Go Example

package main

import (
	"encoding/json"
	"fmt"
	"log"

	"github.com/Goldziher/html-to-markdown/packages/go/htmltomarkdown"
)

func main() {
	html := `
	<html>
		<head>
			<title>Developer Guide</title>
			<meta name="description" content="Complete API reference">
		</head>
		<body>
			<h1>API Overview</h1>
			<p>Learn more at our <a href="https://api.example.com/docs">API docs</a>.</p>
			<img src="https://example.com/api-flow.png" alt="API Flow" width="1280" height="720">
		</body>
	</html>
	`

	markdown, metadata, err := htmltomarkdown.ConvertWithMetadata(html, &htmltomarkdown.MetadataConfig{
		ExtractDocument:     true,
		ExtractHeaders:      true,
		ExtractLinks:        true,
		ExtractImages:       true,
		ExtractStructuredData: false,
	})
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("Markdown:", markdown)
	fmt.Printf("Title: %s\n", metadata.Document.Title)
	fmt.Printf("Found %d links\n", len(metadata.Links))

	// Marshal to JSON if needed
	jsonBytes, _ := json.MarshalIndent(metadata, "", "  ")
	fmt.Println(string(jsonBytes))
}

Java Example

import io.github.goldziher.htmltomarkdown.HtmlToMarkdown;
import io.github.goldziher.htmltomarkdown.ConversionResult;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;

public class MetadataExample {
    public static void main(String[] args) {
        String html = """
            <html>
              <head>
                <title>Java Guide</title>
                <meta name="description" content="Complete Java bindings documentation">
              </head>
              <body>
                <h1>Quick Start</h1>
                <p>Visit our <a href="https://github.com/Goldziher/html-to-markdown">GitHub</a>.</p>
                <img src="https://example.com/java-flow.png" alt="Flow diagram" width="1024" height="576">
              </body>
            </html>
            """;

        try {
            ConversionResult result = HtmlToMarkdown.convertWithMetadata(
                html,
                new HtmlToMarkdown.MetadataOptions()
                    .extractDocument(true)
                    .extractHeaders(true)
                    .extractLinks(true)
                    .extractImages(true)
            );

            System.out.println("Markdown:\n" + result.getMarkdown());
            System.out.println("Title: " + result.getMetadata().getDocument().getTitle());
            System.out.println("Links found: " + result.getMetadata().getLinks().size());

            // Pretty-print metadata as JSON
            Gson gson = new GsonBuilder().setPrettyPrinting().create();
            System.out.println(gson.toJson(result.getMetadata()));
        } catch (HtmlToMarkdown.ConversionException e) {
            System.err.println("Conversion failed: " + e.getMessage());
        }
    }
}

C# Example

using HtmlToMarkdown;
using System.Text.Json;

var html = @"
<html>
  <head>
    <title>C# Guide</title>
    <meta name=""description"" content=""Official C# bindings documentation"">
  </head>
  <body>
    <h1>Introduction</h1>
    <p>See our <a href=""https://github.com/Goldziher/html-to-markdown"">repository</a>.</p>
    <img src=""https://example.com/csharp-arch.png"" alt=""Architecture"" width=""1200"" height=""675"">
  </body>
</html>
";

try
{
    var result = HtmlToMarkdownConverter.ConvertWithMetadata(
        html,
        new MetadataConfig
        {
            ExtractDocument = true,
            ExtractHeaders = true,
            ExtractLinks = true,
            ExtractImages = true,
        }
    );

    Console.WriteLine("Markdown:");
    Console.WriteLine(result.Markdown);

    Console.WriteLine($"Title: {result.Metadata.Document.Title}");
    Console.WriteLine($"Links found: {result.Metadata.Links.Count}");

    // Serialize metadata to JSON
    var options = new JsonSerializerOptions { WriteIndented = true };
    var json = JsonSerializer.Serialize(result.Metadata, options);
    Console.WriteLine(json);
}
catch (HtmlToMarkdownException ex)
{
    Console.Error.WriteLine($"Conversion failed: {ex.Message}");
}

See the individual binding READMEs for detailed metadata extraction options:

Python – Python README
TypeScript/Node.js – Node.js README | TypeScript README
Ruby – Ruby README
PHP – PHP README
Go – Go README
Java – Java README
C#/.NET – C# README
WebAssembly – WASM README
Rust – Rust README

Python (v2 API)

from html_to_markdown import convert, convert_with_inline_images, InlineImageConfig

html = "<h1>Hello</h1><p>Rust ❤️ Markdown</p>"
markdown = convert(html)

markdown, inline_images, warnings = convert_with_inline_images(
    '<img src="data:image/png;base64,...==" alt="Pixel">',
    image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
)

Elixir

{:ok, markdown} = HtmlToMarkdown.convert("<h1>Hello</h1>")

# Keyword options are supported (internally mapped to the Rust ConversionOptions struct)
HtmlToMarkdown.convert!("<p>Wrap me</p>", wrap: true, wrap_width: 32, preprocessing: %{enabled: true})

Rust

use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle};

let html = "<h1>Welcome</h1><p>Fast conversion</p>";
let markdown = convert(html, None)?;

let options = ConversionOptions {
    heading_style: HeadingStyle::Atx,
    ..Default::default()
};
let markdown = convert(html, Some(options))?;

See the language-specific READMEs for complete configuration, hOCR workflows, and inline image extraction.

Performance

Benchmarked on Apple M4 with complex real-world documents (Wikipedia articles, tables, lists):

Operations per Second (higher is better)

Derived directly from tools/runtime-bench/results/latest.json (Apple M4, shared fixtures):

Fixture	Node.js (NAPI)	WASM	Python (PyO3)	Speedup (Node vs Python)
Lists (Timeline)	1,308	882	1,405	0.9×
Tables (Countries)	331	242	352	0.9×
Medium (Python)	150	121	158	1.0×
Large (Rust)	163	124	183	0.9×
Small (Intro)	208	163	223	0.9×
HOCR German PDF	2,944	1,637	2,991	1.0×
HOCR Invoice	27,326	7,775	23,500	1.2×
HOCR Tables	3,475	1,667	3,464	1.0×

Average Performance Summary

Implementation	Avg ops/sec (fixtures)	vs Python	Notes
Rust CLI/Binary	4,996	1.2× faster	Preprocessing now stays in one pass + reuses `parse_owned`, so the CLI leads every fixture
Node.js (NAPI-RS)	4,488	1.0×	Buffer/handle combo keeps Node within ~10 % of the Rust core while serving JS runtimes
Ruby (magnus)	4,278	0.9×	Still extremely fast; ~25 k ops/sec on HOCR invoices without extra work
Python (PyO3)	4,034	baseline	Release-mode harness plus handle reuse keep it competitive, but it now trails Node/Rust
WebAssembly	1,576	0.4×	Portable option for Deno/browsers/edge using the new byte APIs
PHP (ext)	1,480	0.4×	Composer extension holds steady at 35–70 MB/s once the PIE build is installed

Key Insights

Rust now leads throughput: the fused preprocessing + parse_owned pathway pushes the CLI to ~1.7 k ops/sec on the 129 KB lists page and ~31 k ops/sec on the HOCR invoice fixture.
Node.js trails by only a few percent after the buffer/handle work—~1.3 k ops/sec on the lists fixture and 27 k ops/sec on HOCR invoices without any UTF-16 copies.
Python remains competitive but now sits below Node/Rust (~4.0 k average ops/sec); stick to the v2 API to avoid the deprecated compatibility shim.
Elixir matches the Rust core because the Rustler NIF executes the same ConversionOptions pipeline—benchmarks land between 170–1,460 ops/sec on the Wikipedia fixtures and >20 k ops/sec on micro HOCR payloads.
PHP and WASM stay in the 35–70 MB/s band, which is plenty for Composer queues or edge runtimes as long as the extension/module is built ahead of time.
Rust CLI results now mirror the bindings, since task bench:bindings runs the harness with cargo run --release by default—profile there, then push optimizations down into each FFI layer.

Runtime Benchmarks (PHP / Ruby / Python / Node / WASM)

Measured on Apple M4 using the fixture-driven runtime harness in tools/runtime-bench (task bench:bindings). Every binding consumes the exact same HTML fixtures and hOCR samples from test_documents/:

Document	Size	Ruby ops/sec	PHP ops/sec	Python ops/sec	Node ops/sec	WASM ops/sec	Elixir ops/sec	Rust ops/sec
Lists (Timeline)	129 KB	1,349	533	1,405	1,308	882	1,463	1,700
Tables (Countries)	360 KB	326	118	352	331	242	357	416
Medium (Python)	657 KB	157	59	158	150	121	171	190
Large (Rust)	567 KB	174	65	183	163	124	174	220
Small (Intro)	463 KB	214	83	223	208	163	247	258
HOCR German PDF	44 KB	2,936	1,007	2,991	2,944	1,637	3,113	2,760
HOCR Invoice	4 KB	25,740	8,781	23,500	27,326	7,775	20,424	31,345
HOCR Embedded Tables	37 KB	3,328	1,194	3,464	3,475	1,667	3,366	3,080

The harness shells out to each runtime’s lightweight benchmark driver (packages/*/bin/benchmark.*, crates/*/bin/benchmark.ts), feeds fixtures defined in tools/runtime-bench/fixtures/*.toml, and writes machine-readable JSON reports (tools/runtime-bench/results/latest.json) for regression tracking. Add new languages or scenarios by extending those fixture files and drivers.

Use task bench:bindings to regenerate throughput numbers across all bindings or task bench:bindings:profile to capture CPU/memory samples while the benchmarks run. To focus on specific languages or fixtures (for example, task bench:bindings -- --language elixir), pass --language / --fixture directly to cargo run --manifest-path tools/runtime-bench/Cargo.toml -- ….

Need a call-stack view of the Rust core? Run task flamegraph:rust (or call the harness with --language rust --flamegraph path.svg) to profile a fixture and dump a ready-to-inspect flamegraph in tools/runtime-bench/results/.

Note on Python performance: The current Python bindings have optimization opportunities. The v2 API with direct convert() calls performs best; avoid the v1 compatibility layer for performance-critical applications.

Compatibility (v1 → v2)

Testing

Use the task runner to execute the entire matrix locally:

# All core test suites (Rust, Python, Ruby, Node, PHP, Go, C#, Elixir, Java)
task test

# Run the Wasmtime-backed WASM integration tests
task wasm:test:wasmtime

The Wasmtime suite builds the html-to-markdown-wasm artifact with the same flags used in CI and drives it through Wasmtime to ensure the non-JS runtime behaves exactly like the browser/Deno builds.

V2’s Rust core sustains 150–210 MB/s throughput; V1 averaged ≈ 2.5 MB/s in its Python/BeautifulSoup implementation (60–80× faster).
The Python package offers a compatibility shim in html_to_markdown.v1_compat (convert_to_markdown, convert_to_markdown_stream, markdownify). The shim is deprecated, emits DeprecationWarning on every call, and will be removed in v3.0—plan migrations now. Details and keyword mappings live in Python README.
CLI flag changes, option renames, and other breaking updates are summarised in CHANGELOG.

Community

Chat with us on Discord
Explore the broader Kreuzberg document-processing ecosystem
Sponsor development via GitHub Sponsors

Ruby

require 'html_to_markdown'

html = '<h1>Hello</h1><p>Rust ❤️ Markdown</p>'
markdown = HtmlToMarkdown.convert(html, heading_style: :atx, wrap: true)

puts markdown
# # Hello
#
# Rust ❤️ Markdown

See the language-specific READMEs for complete configuration, hOCR workflows, and inline image extraction.

Name		Name	Last commit message	Last commit date
Latest commit History 1,063 Commits
.cargo		.cargo
.github		.github
.mvn/wrapper		.mvn/wrapper
.playwright-mcp		.playwright-mcp
crates		crates
docs		docs
e2e/wasm-wasmtime		e2e/wasm-wasmtime
examples		examples
packages		packages
scripts		scripts
test_documents		test_documents
tools/runtime-bench		tools/runtime-bench
.commitlintrc		.commitlintrc
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
.golangci.yml		.golangci.yml
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
Taskfile.yaml		Taskfile.yaml
ai-rulez.yaml		ai-rulez.yaml
benchmark_parser_Cargo.toml		benchmark_parser_Cargo.toml
biome.json		biome.json
composer.json		composer.json
mvnw		mvnw
mvnw.cmd		mvnw.cmd
package.json		package.json
parser_benchmark.rs		parser_benchmark.rs
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
pyproject.toml		pyproject.toml
rustfmt.toml		rustfmt.toml
tsconfig.base.json		tsconfig.base.json
uv.lock		uv.lock

License

Goldziher/html-to-markdown

Folders and files

Latest commit

History

Repository files navigation

html-to-markdown

🎮 Try the Live Demo →

Why html-to-markdown?

Documentation

Installation

Quick Start

JavaScript/TypeScript

Metadata extraction (all languages)

CLI

Metadata Extraction

CLI Examples

Python Example

TypeScript/Node.js Example

Ruby Example

PHP Example

Go Example

Java Example

C# Example

Python (v2 API)

Elixir

Rust

Performance

Operations per Second (higher is better)

Average Performance Summary

Key Insights

Runtime Benchmarks (PHP / Ruby / Python / Node / WASM)

Compatibility (v1 → v2)

Testing

Community

Ruby

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 65

Packages 0

Used by 108

Contributors 45

Languages

Packages