Show HN: Crawlora-deadweb – tell if a domain is dead or just blocking your bot

tonywangcn1 pts0 comments

GitHub - Crawlora-org/crawlora-deadweb: Is a domain genuinely dead, or just blocking your bot? A passive, local, MIT Go CLI + library that classifies domain reachability (alive/redirect/blocked/dead). The open methodology behind the Crawlora Dead-Web Index. · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

Crawlora-org

crawlora-deadweb

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>1 Commit<br>1 Commit

.github/workflows

.github/workflows

classify

classify

.gitignore

.gitignore

.goreleaser.yaml

.goreleaser.yaml

LICENSE

LICENSE

README.md

README.md

go.mod

go.mod

main.go

main.go

View all files

Repository files navigation

crawlora-deadweb

Is a domain genuinely dead — or just blocking your bot? Tell them apart from one passive probe.

crawlora-deadweb is a small, dependency-free CLI (and Go library) that probes a domain and<br>classifies it alive / redirect / blocked / dead , with the reason. It tells a domain that's<br>gone (no DNS, nothing listening) from one that's alive but refusing automated clients<br>(403 / 429 / anti-bot). Most "dead link" checkers conflate the two — and that's exactly the error<br>behind the myth that ~27% of the web is dead. It isn't.

It is a classifier, not an unblocker. It does a DNS lookup, a TCP connect, and one honest<br>GET /, reads the response, and labels it. It never logs in, submits a form, solves a challenge,<br>or tries to defeat anything.

Classification runs locally and open , from the public response. For the measured<br>browser-fingerprint arm — re-probing a blocked domain with a real Chrome TLS/JA3 client across the<br>proxied fleet to see which "blocked" sites are actually reachable — add --browser, which calls<br>Crawlora's hosted engine.

This powers, and is the open companion to, the Dead-Web Index —<br>a reachability census of the top 10 million domains that found ~14% genuinely dead, not the usual<br>27.6% (most "dead" is anti-bot blocking or a served error).

What the labels mean

alive — a usable HTTP response (2xx, or a 4xx/5xx the server answered — a response isn't death).

redirect — ended on an unresolved redirect.

blocked — the host is up but won't serve us: anti-bot / auth / rate-limit, or it accepts a TCP<br>connection but won't complete HTTP (tarpit / strict TLS).

dead — no DNS resolution, a refused/reset connection, or nothing listening. Genuinely gone.

Install

# from source (Go 1.23+)<br>go install github.com/Crawlora-org/crawlora-deadweb@latest

# or clone + build<br>git clone https://github.com/Crawlora-org/crawlora-deadweb<br>cd crawlora-deadweb && go build -o crawlora-deadweb .

Prebuilt Linux / macOS / Windows binaries are published via GitHub Releases.

Usage

[domain...]">crawlora-deadweb [flags] [domain...]

$ crawlora-deadweb grooveshark.com reuters.com<br>grooveshark.com<br>outcome dead<br>reason dns_failed — genuinely unreachable

reuters.com<br>outcome blocked<br>reason forbidden (403) — alive but refusing this client<br>(run with --browser for the measured browser-fingerprint arm)

--json emits NDJSON (one compact object per line) — pipe straight into jq -c or a data pipeline.

Batch / pipelines. Pass many domains as args, or pipe a list on stdin (one per line; blank lines<br>and #-comments ignored). Domains are probed in parallel (--concurrency, default 8):

results.ndjson<br>printf 'grooveshark.com\nexample.com\n' | crawlora-deadweb">cat domains.txt | crawlora-deadweb --json --concurrency 50 > results.ndjson<br>printf 'grooveshark.com\nexample.com\n' | crawlora-deadweb

Each JSON record matches the open dataset schema:<br>domain, tld, outcome, reason, first_status, final_status, final_url, scheme, hops, parked.

The browser arm (optional, hosted)

The local probe is a polite HTTP request from your IP, so "blocked" is an upper bound — a vendor<br>refusing a datacenter client ≠ the site being unreachable. For the measured tier — what actually<br>gets through with a real browser fingerprint and the proxied fleet — add --browser:

export CRAWLORA_API_KEY=... # get one at...

crawlora deadweb dead domain github blocked

Related Articles