Show HN: CLI for crawling documentation sites into Markdown with defuddle

nistuley1 pts0 comments

GitHub - artemnistuley/docrawl: Lightweight CLI for crawling documentation sites into Markdown with defuddle · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

artemnistuley

docrawl

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>6 Commits<br>6 Commits

src

src

test

test

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

package-lock.json

package-lock.json

package.json

package.json

tsconfig.json

tsconfig.json

View all files

Repository files navigation

docrawl

docrawl is a lightweight Node.js CLI for crawling documentation sites and converting them into Markdown with defuddle.

It is built for static and server-rendered docs sites such as Docusaurus, VitePress, MkDocs, GitBook exports, and Obsidian Publish. It does not run a browser and does not execute page JavaScript.

Why

docrawl is useful when you want to:

turn docs sites into Markdown for LLM context

build local knowledge bases

feed content into RAG pipelines

archive clean docs content without a browser dependency

Requirements

Node.js >= 20

Install

Run without installing:

npx docrawl --help

Install globally:

npm install -g docrawl

Then run:

docrawl --help

Local setup

npm install

Development

Build:

npm run build

Run the CLI from the project workspace:

npm run start -- --help

Run tests:

npm test

CLI

crawl

[options]">docrawl crawl url> [options]

Examples:

# Crawl a docs section into ./output<br>docrawl crawl https://docs.example.com/guide/

# Run a smaller smoke test first<br>docrawl crawl https://docs.example.com/guide/ --max-pages 10 --depth 1 --verbose

# Merge everything into one file<br>docrawl crawl https://docs.example.com/guide/ --single-file --output ./context.md

# Crawl the full hostname, not only the seed path subtree<br>docrawl crawl https://docs.example.com --domain --max-pages 200

Options:

Output directory or file path<br>-s, --single-file Merge all pages into one Markdown file<br>--domain Crawl the whole hostname, not just the seed path<br>--depth Maximum crawl depth<br>--max-pages Maximum pages to process (default: 500)<br>--concurrency Concurrent requests (default: 3)<br>--delay Delay between requests per worker (default: 500)<br>--lang Preferred language, BCP 47<br>--no-sitemap Disable sitemap discovery<br>--include Include URL glob pattern, repeatable<br>--exclude Exclude URL glob pattern, repeatable<br>--verbose Verbose progress logging">-o, --output Output directory or file path<br>-s, --single-file Merge all pages into one Markdown file<br>--domain Crawl the whole hostname, not just the seed path<br>--depth Maximum crawl depth<br>--max-pages Maximum pages to process (default: 500)<br>--concurrency Concurrent requests (default: 3)<br>--delay Delay between requests per worker (default: 500)<br>--lang Preferred language, BCP 47<br>--no-sitemap Disable sitemap discovery<br>--include Include URL glob pattern, repeatable<br>--exclude Exclude URL glob pattern, repeatable<br>--verbose Verbose progress logging

parse

[options]">docrawl parse url> [options]

Examples:

# Parse one page as Markdown<br>docrawl parse https://docs.example.com/guide/intro

# Parse one page as JSON<br>docrawl parse https://docs.example.com/guide/intro --json

Options:

Preferred language, BCP 47">-j, --json Output full JSON response<br>--lang Preferred language, BCP 47

Output

Separate files

By default, docrawl crawl writes one Markdown file per successful page and a manifest.json.

Example layout:

output/<br>├── getting-started/<br>│ ├── introduction.md<br>│ └── quickstart.md<br>└── manifest.json

Each Markdown file includes frontmatter with fields such as:

title

sourceUrl

finalUrl

canonicalUrl

crawledAt

depth

wordCount

contentHash

Single file

With --single-file, docrawl writes:

one merged Markdown file

one adjacent manifest file named like .manifest.json

The merged file includes a table of contents and one section per successful page.

Example:

docrawl crawl https://docs.example.com --single-file --output ./context.md

Produces:

context.md<br>context.manifest.json

Current...

docrawl file json crawl docs markdown

Related Articles