Defuddle: Extract Clean Content from Any Web Page

Extract Clean Content from Any Web Page

Defuddle https://github.com/kepano/defuddle extracts the main content from web pages, removing clutter like comments, sidebars, headers, footers, and other non-essential elements.

With 5.6k stars and 224 forks, defuddle is a popular open-source tool for content extraction, originally created for the Obsidian Web Clipper but designed to run in any environment.

How It Works

Defuddle takes a URL or HTML input, identifies the main content using heuristics, and returns cleaned HTML or Markdown. It was built as a replacement for Mozilla Readability with some key differences:

More forgiving, removes fewer uncertain elements
Provides consistent output for footnotes, math, and code blocks
Uses mobile styles to guess unnecessary elements
Extracts more metadata, including schema.org data

Usage

Browser:

import Defuddle from "defuddle";

const defuddle = new Defuddle(document);
const result = defuddle.parse();

console.log(result.content);
console.log(result.title);
console.log(result.author);

Node.js:

import { JSDOM } from "jsdom";
import { Defuddle } from "defuddle/node";

const dom = new JSDOM(html, { url: "https://example.com/article" });
const result = await Defuddle(
  dom.window.document,
  "https://example.com/article",
  {
    markdown: true,
  },
);

CLI:

npx defuddle parse https://example.com/article --markdown

Response Properties

Defuddle returns an object with:

content — Cleaned extracted content
title — Article title
author — Author name
description — Description or summary
domain — Domain name
image — Main article image
published — Publication date
wordCount — Total word count

Bundles

Core (defuddle) — Main bundle for browser, no dependencies
Full (defuddle/full) — Includes math equation parsing and Markdown conversion
Node.js (defuddle/node) — For Node environments, accepts any DOM implementation

Install

npm install defuddle

🔗 GitHub: github.com/kepano/defuddle