Defuddle: Extract Clean Content from Any Web Page
Extract Clean Content from Any Web Page
Defuddle https://github.com/kepano/defuddle extracts the main content from web pages, removing clutter like comments, sidebars, headers, footers, and other non-essential elements.
With 5.6k stars and 224 forks, defuddle is a popular open-source tool for content extraction, originally created for the Obsidian Web Clipper but designed to run in any environment.
How It Works
Defuddle takes a URL or HTML input, identifies the main content using heuristics, and returns cleaned HTML or Markdown. It was built as a replacement for Mozilla Readability with some key differences:
- More forgiving, removes fewer uncertain elements
- Provides consistent output for footnotes, math, and code blocks
- Uses mobile styles to guess unnecessary elements
- Extracts more metadata, including schema.org data
Usage
Browser:
import Defuddle from "defuddle";
const defuddle = new Defuddle(document);
const result = defuddle.parse();
console.log(result.content);
console.log(result.title);
console.log(result.author);Node.js:
import { JSDOM } from "jsdom";
import { Defuddle } from "defuddle/node";
const dom = new JSDOM(html, { url: "https://example.com/article" });
const result = await Defuddle(
dom.window.document,
"https://example.com/article",
{
markdown: true,
},
);CLI:
npx defuddle parse https://example.com/article --markdownResponse Properties
Defuddle returns an object with:
content— Cleaned extracted contenttitle— Article titleauthor— Author namedescription— Description or summarydomain— Domain nameimage— Main article imagepublished— Publication datewordCount— Total word count
Bundles
- Core (
defuddle) — Main bundle for browser, no dependencies - Full (
defuddle/full) — Includes math equation parsing and Markdown conversion - Node.js (
defuddle/node) — For Node environments, accepts any DOM implementation
Install
npm install defuddle🔗 GitHub: github.com/kepano/defuddle