Edgar Allan Poe found bugs in Turso

dockerd1 pts0 comments

How Edgar Allan Poe found bugs in TursoRegister now for early access to concurrent writes in the Turso Cloud. Join the waitlist

Jun 8, 2026<br>How Edgar Allan Poe found bugs in Turso<br>How we use LLM coding agents to autonomously find bugs in Turso, and why giving them creative personas makes them better database testers.<br>Mikaël Francoeur

This is a post about testing relational databases with LLMs. We've been doing this a lot at Turso, and we're not the only ones. LLMs are such overpowered bug-finding tools that the SQLite team has even had to start a second SQLite forum reserved for bugs.

#The Problem

Testing a relational database is challenging, because of the very large input space prescribed by the SQL language. Think of ensuring that SELECT * FROM t returns the correct results. Easy, just fill t with data, run the statement, assert that the data matches. Maybe also check some edge cases, such as empty tables. Now consider SELECT * FROM t NATURAL JOIN tt. The input parameters are suddenly more numerous, leading to an explosion of combinations:

table t is empty, but not tt, and vice versa

the two tables share no/one/many columns

no/some/all rows in table t match with table tt, and vice versa

Already, testing this requires substantially more thinking. Now consider that we have to do this for all combinations of left/right joins, full outer joins, compound selects, subqueries (scalar, correlated), common table expressions (recursive), window functions, custom window definitions, (ordered-set) aggregates, etc.

#Oracles

This is why a great deal of effort has gone into designing oracles: heuristics that determine the correctness of a wide subset of SQL. For example, here's a simple oracle:

take a SELECT statement, run it and remember the result

read the whole table, and remember it

convert the first SELECT into an INSERT INTO ... SELECT, and run it

read the table again, and assert that it is equal to steps 1 + 2

The nice thing about this kind of oracle is that it works with a broad range of SELECT statements, without hand-writing expected results.

Tools like SQLancer or SQLRight couple oracles with pseudo-random SQL generation. SQLRight goes one step further and adds code coverage as a fitness function to orient query generation.

The downside of this is what we saw in the beginning: SQL queries are infinitely varied, and bugs tend to amass in corners of the input space where specific and unpredictable sequences or combinations of features occur. Tools that rely on random generation therefore have to run for extended, sometimes impractical, periods of time. For example, the SQLRight authors ran SQLRight for 60 days, yielding 14 logical bugs in SQLite.

#LLMs for guided pseudo-random generation

Almost a year ago, we started experimenting with LLMs for autonomous testing. Our first experiments were simple: tell ChatGPT about Turso, and ask it to come up with self-contained SQL snippets that might show bugs. This proved surprisingly effective.

Then we improved this process by using coding agents like Claude Code and Codex, and that's where things really took a turn. Claude Code and the Ralph Loop plugin proved to be unreasonably effective. We developed prompts that we could run in a loop for days at a time, and the agent would find dozens, if not hundreds of bugs, and what's more, some of them were genuinely hard-to-find bugs. For example:

Panic on JOIN with empty left table, ungrouped aggregate, right-table bare column, and unindexed JOIN column (#5233)

Panic on an UPDATE with a triple self-join subquery and a window function (#5223)

These prompts tend to be a few dozen lines. You can see a full prompt here. Let's just list the important aspects of the prompt:

Give the agent a goal , ex: "Your goal is to identify as many bugs as possible in Turso."

Give it a fitness function . In our case, we tell it to try SQL snippets against Turso and SQLite to find bugs. Turso is SQLite-compatible, so differential testing with SQLite is a good oracle, albeit not perfect.1 We even have a small script in the source tree that runs a statement against Turso and SQLite and says if the results match.

Give it a direction , so that it can roughly orient its autonomous exploration. In our case, we tell it to:

analyze files, focusing on areas of complexity

avoid happy paths

favour queries with unusual shapes

keep a log of things tried and re-read it after compaction

#Random testing as a search quest

One problem we've encountered with LLM loops is that after an extended period of time (over a day in my experience), agents tend to go around in circles, and they stop exploring new paths.

Researchers have studied a closely related failure mode in the paper "Inducing Sustained Creativity and Diversity in Large Language Models" (Luo, King, Puett and Smith 2026). They added a "priming phrase" (e.g. the phrase "Related to" plus a random noun) at the beginning of the prompt, and a "diverting token" at the end, and...

bugs turso table sqlite testing select

Related Articles