Why We Outgrew Cloudflare D1 (And Everything We Tried Before Building Our Own Solution)<br>Part 1 of the D0 Series
We run our entire product on Cloudflare. Not "mostly on Cloudflare" or "Cloudflare for the edge layer" - we mean the whole thing: workers, storage, AI, queues, real-time WebSockets, and everything in between. That decision made us, arguably, one of the most demanding power users of Cloudflare's infrastructure in existence. It also meant we hit every sharp edge of every product they ship, usually long before anyone else did - and because we were so deeply reliant on the entire stack, an outage in any single Cloudflare product was an outage for us. Workers AI goes down, our AI features go down. D1 has an incident, we can't read or write any data. It didn't matter what else was healthy.
We've been on D1 since the private alpha in August 2022. That matters for the timeline of what follows - several of these problems weren't edge cases we stumbled into after scale. We hit them early, we reported them, and in some cases we watched them sit unfixed for years.
This series is the honest account of what broke, what we tried, and what we eventually had to build ourselves. This first post is about D1 - Cloudflare's SQLite-based serverless database - what its real-world limits look like under serious multi-tenant load, and every fix we reached for before we concluded that no workaround was going to hold at the scale we were heading toward.
Our D1 Architecture, Before It Became a Problem
The way we structured our data model was intentional and, at the time, the only sane approach. Every user got their own D1 database. Every tenant got one. Every dataspace (our term for a tenant's workspace and its associated data) got one too.
That's a three-tier multi-tenant setup, and it was deliberate - complete data isolation at every level, no cross-tenant query risk, clean per-user storage boundaries. User and tenant databases stayed small - maybe a few kilobytes of permissions, IDs, and lightweight profile data, heavy on reads, almost nothing on writes. Dataspace databases were the opposite: constantly written to, constantly read, growing as long as a customer kept using the product.
By the time we were deep in production, with only a few customers and pilots, we had over 421 D1 databases and counting. That number matters a lot for almost every problem that follows.
Problem 1: REST vs. Binding, and the Routing Nightmare
D1 gives you two ways to talk to a database: over REST via the Cloudflare API, or via a Worker binding. The difference in practice is enormous, and it's not well-documented.
A binding connects you directly to the D1 instance. The Worker that uses it and the D1 it's bound to resolve to each other at the edge, with virtually no routing overhead. It's fast in a way that's almost unfair to compare against the alternative.
REST, at least early on, was a tour of Cloudflare's internal network topology. We're based in California, so our tests were always based out of LAX - the nearest PoP. A REST query from LAX would bounce north to PDX (Portland), Cloudflare's North American control plane core - where api.cloudflare.com itself runs - because every REST call has to go through there. PDX would then route back down to (LAX, SJC, DFW, SEA, or DEN) where the D1 instance happened to be running (our own tenant is placed in WNAM). The response made the same trip in reverse. Four hops for a single database query, and two of them are a ~600 mile (~960 km) round trip up and down the West Coast. That's the best case. Several of our upstream services were hosted on the East Coast, which meant crossing the continent twice per query.
And for our farthest customers - for example we serve Crypto.com's Hong Kong office, which Cloudflare routes through their APAC region - that api core bounce to SFO was a transpacific round trip on top of everything else.
By the time Crypto.com joined us as a customer, PDX hop was replaced by a nearer SFO hop coming from APAC
Cloudflare has shipped meaningful changes for this
Their Code Orange remediation work (stemming from their Nov 2023 incident and finished May 2026), where they decentralized and decoupled their api plane core and spread it across more PoPs.
Apr 10, 2025 when they accounced Read Replication which greatly helped our root lookup table's accessibility, given it's read-heavy use by spawning copies world-wide.
May 29, 2025: D1 REST requests are now handled at the closest PoP to the incoming request, so /query and /raw calls no longer have to proxy through the control plane core at all. The changelog puts the improvement at 50-500ms depending on request and database location, with the biggest gains going to databases outside the U.S. since PDX is still where control plane metadata lives. It's a real improvement. But even with that fix, REST still carries all the overhead of an HTTP server: connection setup, TLS negotiation, parsing, headers. Bindings have none of...