Same Query, Three Results: Benchmarking ParadeDB and Postgres FTS

Same Query, Three Results: Benchmarking ParadeDB and Postgres FTS | ParadeDB

Browse Posts

By James Blackwood-Sewell on June 2, 2026

Most database benchmarks publish one story based on one run. The trouble is that the same query, on the same data and hardware, can produce different results depending on workload or scheduling decisions. A benchmark that picks one set of choices and stops there can still mislead, even when the run itself is fair.

We built ParadeDB Benchmarker to support a wide range of methodologies, making each iteration of a benchmark an easy lift. The runner stays the same as your workload evolves, and the same runner can be used across wildly different benchmarks.

To demonstrate, we ran one TopK full-text search query against ParadeDB and Postgres FTS across three passes, while the dataset, query shape, hardware, and backend setup stayed fixed. Pass 1 used a single hardcoded term in a closed loop, and the two backends sat within 10% of each other. Pass 2 swapped the workload for a forty-term rotation, and the throughput gap widened to ~29x. Pass 3 kept that workload but switched the execution model to a fixed-rate open loop well inside both backends' capacity, and a ~47x P99 latency gap opened.

Benchmarker

k6 is Grafana's load testing framework, usually known for front-end testing. It already handles the hard execution-engine problems for any load benchmark: virtual user scheduling, request firing, latency measurement, and ramping load. We love k6.

Benchmarker is a runner built on top. It builds a custom k6 binary with our multi-backend xk6-database extension compiled in, alongside a loader CLI, dataset tooling, Docker compose profiles, and a real-time dashboard. A single k6 JavaScript script defines backends, datasets, term sources, and scenarios in one place, with run-time artifacts like container metrics and backend configuration being captured on run.

Two execution shapes matter for this post: closed-loop and open-loop (PlanetScale have a good primer on these in their excellent on-benchmarking post).

In a closed-loop run, each virtual user sends a query, waits for the response, and then sends the next one, so throughput is an outcome of how quickly the database answers. This is useful for asking how much work a backend can serve with a fixed amount of client concurrency. In k6, Benchmarker uses the constant-vus executor for this shape.

In an open-loop run, the runner starts queries on a schedule, such as 50 QPS, so the offered rate is fixed and latency shows up as slower completions or missed iterations. Benchmarker uses k6's constant-arrival-rate executor for this shape, with maxVUs capping how many workers can run scheduled queries at once.

The concrete environment is explained below in What Stayed Fixed, and the commands to reproduce are in Try It Yourself.

What Stayed Fixed

Before we get to results, it helps to name what we did not change. The query shape, dataset, hardware, and worker budget stay fixed across the passes; we only changed the term source and, in the final pass, the arrival strategy.

Both backends run in their own Docker containers, each with four cores and eight gigabytes of memory. We restart the containers1 between passes so each database process starts cleanly, rather than carrying connection state or backend-local state from the previous run.

The data is a cut-down slice2 of the Hacker News archive: one million rows in a single hn_items table, keeping only the id and text of each post.

The comparison is ParadeDB's BM25 index against Postgres' built-in tsvector datatype and a GIN index (often referred to as Postgres full-text search). These are different ranking models over different index implementations, but both are ways developers can run TopK relevance search inside Postgres.

ParadeDB indexes text directly with BM25. Postgres pre-tokenizes text into a stored generated tsv column and indexes that value with GIN3. Both are configured for English text processing, and both answer the same query shape using their native syntax:

-- ParadeDB BM25 SELECT id, text, pdb.score(id) AS score FROM hn_items WHERE text ||| $1 ORDER BY score DESC LIMIT 10;

-- Postgres FTS SELECT id, text, ts_rank(tsv, plainto_tsquery('english', $1)) AS score FROM hn_items WHERE tsv @@ plainto_tsquery('english', $1) ORDER BY score DESC LIMIT 10;

The query shape does not change between passes. In the first pass, term is the literal string "inverted", which matches 390 documents in this dataset. In the second and third, it cycles through a list of forty real search terms4 that match anywhere from a few hundred to tens of thousands of rows.

With the data, indexes, and query shape fixed, we can start with the simplest credible run and then make the workload harder one step at a time.

Pass 1: The Plausible First Answer (within 10%)

In the first pass, we keep the setup simple: sixteen...

Same Query, Three Results: Benchmarking ParadeDB and Postgres FTS

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy