GitHub - artemnistuley/docrawl: Lightweight CLI for crawling documentation sites into Markdown with defuddle · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
artemnistuley
docrawl
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>6 Commits<br>6 Commits
src
src
test
test
.gitignore
.gitignore
LICENSE
LICENSE
README.md
README.md
package-lock.json
package-lock.json
package.json
package.json
tsconfig.json
tsconfig.json
View all files
Repository files navigation
docrawl
docrawl is a lightweight Node.js CLI for crawling documentation sites and converting them into Markdown with defuddle.
It is built for static and server-rendered docs sites such as Docusaurus, VitePress, MkDocs, GitBook exports, and Obsidian Publish. It does not run a browser and does not execute page JavaScript.
Why
docrawl is useful when you want to:
turn docs sites into Markdown for LLM context
build local knowledge bases
feed content into RAG pipelines
archive clean docs content without a browser dependency
Requirements
Node.js >= 20
Install
Run without installing:
npx docrawl --help
Install globally:
npm install -g docrawl
Then run:
docrawl --help
Local setup
npm install
Development
Build:
npm run build
Run the CLI from the project workspace:
npm run start -- --help
Run tests:
npm test
CLI
crawl
[options]">docrawl crawl url> [options]
Examples:
# Crawl a docs section into ./output<br>docrawl crawl https://docs.example.com/guide/
# Run a smaller smoke test first<br>docrawl crawl https://docs.example.com/guide/ --max-pages 10 --depth 1 --verbose
# Merge everything into one file<br>docrawl crawl https://docs.example.com/guide/ --single-file --output ./context.md
# Crawl the full hostname, not only the seed path subtree<br>docrawl crawl https://docs.example.com --domain --max-pages 200
Options:
Output directory or file path<br>-s, --single-file Merge all pages into one Markdown file<br>--domain Crawl the whole hostname, not just the seed path<br>--depth Maximum crawl depth<br>--max-pages Maximum pages to process (default: 500)<br>--concurrency Concurrent requests (default: 3)<br>--delay Delay between requests per worker (default: 500)<br>--lang Preferred language, BCP 47<br>--no-sitemap Disable sitemap discovery<br>--include Include URL glob pattern, repeatable<br>--exclude Exclude URL glob pattern, repeatable<br>--verbose Verbose progress logging">-o, --output Output directory or file path<br>-s, --single-file Merge all pages into one Markdown file<br>--domain Crawl the whole hostname, not just the seed path<br>--depth Maximum crawl depth<br>--max-pages Maximum pages to process (default: 500)<br>--concurrency Concurrent requests (default: 3)<br>--delay Delay between requests per worker (default: 500)<br>--lang Preferred language, BCP 47<br>--no-sitemap Disable sitemap discovery<br>--include Include URL glob pattern, repeatable<br>--exclude Exclude URL glob pattern, repeatable<br>--verbose Verbose progress logging
parse
[options]">docrawl parse url> [options]
Examples:
# Parse one page as Markdown<br>docrawl parse https://docs.example.com/guide/intro
# Parse one page as JSON<br>docrawl parse https://docs.example.com/guide/intro --json
Options:
Preferred language, BCP 47">-j, --json Output full JSON response<br>--lang Preferred language, BCP 47
Output
Separate files
By default, docrawl crawl writes one Markdown file per successful page and a manifest.json.
Example layout:
output/<br>├── getting-started/<br>│ ├── introduction.md<br>│ └── quickstart.md<br>└── manifest.json
Each Markdown file includes frontmatter with fields such as:
title
sourceUrl
finalUrl
canonicalUrl
crawledAt
depth
wordCount
contentHash
Single file
With --single-file, docrawl writes:
one merged Markdown file
one adjacent manifest file named like .manifest.json
The merged file includes a table of contents and one section per successful page.
Example:
docrawl crawl https://docs.example.com --single-file --output ./context.md
Produces:
context.md<br>context.manifest.json
Current...