Tracing Codex's 640TB-a-year SQLite writes

Query Doctor | Tracing Codex's 640TB/year SQLite Writes Try IndeX-Ray free

← All posts Tracing Codex's 640TB/year SQLite Writes Instrumenting the Codex binary at runtime to see what queries it's making to the DB X Xetera Jun 25, 2026 · 9 min read

A recent Github issue for OpenAI's Codex on how the harness writes way way too many logs in SQLite started gaining traction the other day, and I wanted to take a shot at figuring out what's going on with it. The issue already outlines most of the problem, but I'm curious what's doing the writes.

There are a lot of tools out there that can help diagnose problems like these by logging or keeping track of queries made to databases like Mysql or Postgres. Postgres has a pg_stat_statements extension you can enable to keep track of what queries are being made, for example. But there's really no equivalent in SQLite. There's no extension for logging queries that I'm aware of, and there's no connection to "proxy" and read incoming queries, all queries are made directly on disk. This problem is already notoriously difficult for many databases, but SQLite is the worst version of it I can think of.

To get around this, I've been working on sqlite-trace, an eBPF program for logging queries made by any unknown binary that links either dynamically or statically against SQLite. Now in this specific case, yes... Codex is open source, and we have a good idea why this logging behavior exists. But there are plenty of closed-source programs that use this database, and sometimes you need runtime instrumentation to get a better idea of what's happening regardless. And currently, that's really difficult to do from the outside.

How do you log SQLite queries?

Let me show how it works first with fossil, a version control system made exclusively for SQLite, and dogfoods the db to store version control data.

xetera@lima-ebpf sudo ./build/sqlite_trace --lib $(which fossil) ... ... I run `fossil new a` in a different folder ... omitted extra queries ... ... INSERT OR IGNORE INTO user(login, info) VALUES('xetera','') fossil pid=56296 db=/Users/xetera/projects/sqlite-trace/a rows=0 rc=101(DONE) t=701.9us in: sql=59B bound=0B total=59B (vars=0 scanned=0) app: service=fossil exe=/usr/bin/fossil uid=501 gid=1000 ns_pid=56296 cgroup=0x1746 cmd="fossil new a"

UPDATE user SET cap='s', pw='TGHWVvgdDy' WHERE login='xetera' fossil pid=56296 db=/Users/xetera/projects/sqlite-trace/a rows=0 rc=101(DONE) t=5.2us in: sql=61B bound=0B total=61B (vars=0 scanned=0) app: service=fossil exe=/usr/bin/fossil uid=501 gid=1000 ns_pid=56296 cgroup=0x1746 cmd="fossil new a"

INSERT OR IGNORE INTO user(login,pw,cap,info) VALUES('anonymous',hex(randomblob(8)),'hz','Anon'); fossil pid=56296 db=/Users/xetera/projects/sqlite-trace/a rows=0 rc=101(DONE) t=4.6us in: sql=99B bound=0B total=99B (vars=0 scanned=0) app: service=fossil exe=/usr/bin/fossil uid=501 gid=1000 ns_pid=56296 cgroup=0x1746 cmd="fossil new a It's able to pick up queries the binary makes against the target database, neat.

Normally, this functionality is easily accomplished by attaching uprobes to functions applications call from libraries like libsqlite3. The tricky part is, when installed from apt, fossil doesn't link against it at all.

xetera@lima-ebpf ldd $(which fossil) linux-vdso.so.1 (0x0000f0d6c613c000) libresolv.so.2 => /lib/aarch64-linux-gnu/libresolv.so.2 (0x0000f0d6c5b50000) libssl.so.3 => /lib/aarch64-linux-gnu/libssl.so.3 (0x0000f0d6c5a20000) libcrypto.so.3 => /lib/aarch64-linux-gnu/libcrypto.so.3 (0x0000f0d6c5410000) libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000f0d6c53d0000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000f0d6c5300000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000f0d6c5100000) /lib/ld-linux-aarch64.so.1 (0x0000f0d6c6100000) libzstd.so.1 => /lib/aarch64-linux-gnu/libzstd.so.1 (0x0000f0d6c5030000) Which is a huge pain, because it would be so much nicer if we could just hook into a function like sqlite3_step that produces row outputs in the shared library it targets. And this is how some programs work like with the sqlite3 binary itself. Sadly, in just as many other cases, we have to deal with static linking. To make matters worse, fossil and most other production binaries you would be interested in logging strip all their symbols:

xetera@lima-ebpf nm $(which fossil) nm: /usr/bin/fossil: no symbols xetera@lima-ebpf nm -D $(which fossil) | grep sqlite xetera@lima-ebpf To attempt to solve all this, the program looks for well-known strings that can't be optimized out like cannot commit transaction - SQL statements in progress and abort due to ROLLBACK and follows the chain of calls that reference it down to a plausible looking location.

Also worth keeping in mind that SQLite's public API looks something like this:

typedef struct sqlite3_stmt sqlite3_stmt;

int sqlite3_step(sqlite3_stmt*); // other public functions... Where the sqlite3_stmt is an opaque struct. Which means even if you manage to hook the...

Tracing Codex's 640TB-a-year SQLite writes

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level