Show HN: CLI for crawling documentation sites into Markdown with defuddle

GitHub - artemnistuley/docrawl: Lightweight CLI for crawling documentation sites into Markdown with defuddle · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

artemnistuley

docrawl

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 6 Commits 6 Commits

src

test

.gitignore

LICENSE

README.md

package-lock.json

package.json

tsconfig.json

View all files

Repository files navigation

docrawl

docrawl is a lightweight Node.js CLI for crawling documentation sites and converting them into Markdown with defuddle.

It is built for static and server-rendered docs sites such as Docusaurus, VitePress, MkDocs, GitBook exports, and Obsidian Publish. It does not run a browser and does not execute page JavaScript.

Why

docrawl is useful when you want to:

turn docs sites into Markdown for LLM context

build local knowledge bases

feed content into RAG pipelines

archive clean docs content without a browser dependency

Requirements

Node.js >= 20

Install

Run without installing:

npx docrawl --help

Install globally:

npm install -g docrawl

Then run:

docrawl --help

Local setup

npm install

Development

Build:

npm run build

Run the CLI from the project workspace:

npm run start -- --help

Run tests:

npm test

CLI

crawl

[options]">docrawl crawl url> [options]

Examples:

# Crawl a docs section into ./output docrawl crawl https://docs.example.com/guide/

# Run a smaller smoke test first docrawl crawl https://docs.example.com/guide/ --max-pages 10 --depth 1 --verbose

# Merge everything into one file docrawl crawl https://docs.example.com/guide/ --single-file --output ./context.md

# Crawl the full hostname, not only the seed path subtree docrawl crawl https://docs.example.com --domain --max-pages 200

Options:

Output directory or file path -s, --single-file Merge all pages into one Markdown file --domain Crawl the whole hostname, not just the seed path --depth Maximum crawl depth --max-pages Maximum pages to process (default: 500) --concurrency Concurrent requests (default: 3) --delay Delay between requests per worker (default: 500) --lang Preferred language, BCP 47 --no-sitemap Disable sitemap discovery --include Include URL glob pattern, repeatable --exclude Exclude URL glob pattern, repeatable --verbose Verbose progress logging">-o, --output Output directory or file path -s, --single-file Merge all pages into one Markdown file --domain Crawl the whole hostname, not just the seed path --depth Maximum crawl depth --max-pages Maximum pages to process (default: 500) --concurrency Concurrent requests (default: 3) --delay Delay between requests per worker (default: 500) --lang Preferred language, BCP 47 --no-sitemap Disable sitemap discovery --include Include URL glob pattern, repeatable --exclude Exclude URL glob pattern, repeatable --verbose Verbose progress logging

parse

[options]">docrawl parse url> [options]

Examples:

# Parse one page as Markdown docrawl parse https://docs.example.com/guide/intro

# Parse one page as JSON docrawl parse https://docs.example.com/guide/intro --json

Options:

Preferred language, BCP 47">-j, --json Output full JSON response --lang Preferred language, BCP 47

Output

Separate files

By default, docrawl crawl writes one Markdown file per successful page and a manifest.json.

Example layout:

output/ ├── getting-started/ │ ├── introduction.md │ └── quickstart.md └── manifest.json

Each Markdown file includes frontmatter with fields such as:

title

sourceUrl

finalUrl

canonicalUrl

crawledAt

depth

wordCount

contentHash

Single file

With --single-file, docrawl writes:

one merged Markdown file

one adjacent manifest file named like .manifest.json

The merged file includes a table of contents and one section per successful page.

Example:

docrawl crawl https://docs.example.com --single-file --output ./context.md

Produces:

context.md context.manifest.json

Current...

Show HN: CLI for crawling documentation sites into Markdown with defuddle

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy