How Other Link Checkers Do Recursion

" />

How Other Link Checkers Do Recursion | Matthias Endler

After I published Five Years of Trying to Add Recursion to lychee, one reply I got was a very fair question:

If recursion is so hard, how do other link checkers do it? Plenty of them already crawl websites!

This sent me down a rabbit hole of reading the code of other link checkers. The key takeaway is: they didn’t find a clever trick we missed. They were built as crawlers from the very first commit, and I initially built lychee as a stream.

I went and read the source of the recursive checkers we list in lychee’s README: muffet (Go), LinkChecker (Python), linkinator (TypeScript), and broken-link-checker (JavaScript). This post is a teardown of how each one actually handles recursion, what it costs them, and what it means for lychee.

If you haven’t read the first post, the summary is that lychee was architected as a one-shot, unidirectional pipeline (inputs → extract → check → output). Recursion needs a cycle (responses create new inputs), and cycles in an async, channel-based pipeline are where the dragons live. 🐲 Five years and four attempts later, the pieces we’ll need to do it properly only just landed.

DAGs vs. cycles

Every recursive checker I looked at is built from the same three parts:

A mutable work queue (let’s call it “frontier”), not a fixed input stream. Discovered URLs go back into the same queue they came from.

A visited set that’s updated at enqueue time (before the request completes), so two pages discovering the same link can’t both submit it.

A primitive that answers “is everything done?”: a WaitGroup, a joinable-queue counter, an onIdle() promise, or a queue-drain event.

Diagrammatically, lychee is different from the others:

graph TD subgraph crawler["Everyone else: a cycle"] direction TB CQ[Frontier queue] --> CW[Worker pool] CW --> CP[Fetch and parse page] CP -->|new links| CQ CP --> CR[Results] end subgraph lychee["lychee: a DAG"] direction TB LA[Inputs] --> LB[Extractor] LB --> LC[Checker] LC --> LD[Results] end Crawlers have a back-edge baked in. Our pipeline doesn’t, and every one of my failed attempts was an effort to bend that back-edge into a graph that was never designed for it.

Let’s look at that graph design more closely:

graph TD Seed[Seed URLs] --> Enq["Enqueue step: is URL in visited set?"] Enq -->|yes| Skip[Drop] Enq -->|no| Mark["Mark visited, then push"] Mark --> Q[Frontier queue] Q --> Pool["Worker pool, bounded concurrency"] Pool --> FP[Fetch page and extract links] FP -->|discovered links| Enq FP --> Rec[Results] Q -.->|empty AND no worker busy| Stop[Terminate] Note that the visited check happens in the enqueue step, atomically with the mark, before the worker ever touches the network. That ordering is the entire fix to the deduplication race that haunted lychee’s attempts 1–4, where the cache was written after checking.

Each tool uses a variation on it.

muffet (Go): a WaitGroup and a Set

muffet is closest in spirit to lychee: a fast, single-binary, concurrent website checker. The dedup + scheduling decision lives in one method (page_checker.go):

func (c *pageChecker) addPage(p page) { if !c.donePages.Add(p.URL().String()) { c.daemonManager.Add(func() { c.checkPage(p) }) donePages is a concurrentStringSet (a mutex-guarded map[string]struct{}). Add returns whether the URL was already present, so a page is only scheduled the first time it’s seen. Dedup happens at enqueue, synchronized by the set’s mutex. This is basically a line-by-line translation of the diagram above.

Checking a page fetches all of its links concurrently, and feeds qualifying ones back into addPage, the back-edge:

go func(u string) { defer w.Done() status, p, err := c.fetcher.Fetch(u) // ... if !c.onePageOnly && p != nil && c.linkValidator.Validate(p.URL()) { c.addPage(p) // recursion: discovered page re-enters the frontier }(u)

How muffet knows it’s done

muffet’s answer to termination is a little daemonManager built around a sync.WaitGroup (daemon_manager.go):

func (m daemonManager) Add(f func()) { m.waitGroup.Add(1) m.daemons func() { f() m.waitGroup.Done()

func (m daemonManager) Run() { go func() { for f := range m.daemons { go f() }() m.waitGroup.Wait() // Every scheduled page increments the group; every completed page decrements it; Wait() returns when the count hits zero. The whole crawl bootstraps with a single addPage before Run(), so the counter is positive before anyone waits on it.

This is the same counter I tried (and failed with) in Attempt 1 and Attempt 4. The difference is the invariant: waitGroup.Add(1) is only ever called from inside an already-running daemon that holds the count above zero (or from the bootstrap). There is no window where the counter briefly reads zero while work is still pending. Go’s WaitGroup enforces this invariant so naturally that it doesn’t feel like distributed termination detection at all, but that’s exactly what it is. It’s the moral equivalent of the WaitGroup primitive...

How Other Link Checkers Do Recursion

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy