Tracing an 8kb Postgres Read

shellpipe1 pts0 comments

Tracing an 8kb Postgres read Fernando Simões

Tracing an 8kb Postgres read<br>June 12, 2026A while ago I had an IOPS production incident. I wrote about what was happening: a page comes from shared buffers, the OS page cache, or the disk, and only the disk read counts as an IOPS. But I described those layers without measuring them. So I rebuilt the same pathology on a Linux box I control, where I can follow the behavior of every cache one at a time. The plan is to run the same query three times and measure each run: first with the pages in shared buffers, then with the pages only in the OS page cache, then with everything cold, when the read must come from the disk. After that, I compare the three numbers with two disks I can&rsquo;t touch: the EBS volume from the incident, and a Hetzner server I benchmarked a while ago.<br>The setup<br>A Debian box I keep at home, Postgres 17 in Docker, shared_buffers = 16MB, track_io_timing = on. The storage is a local NVMe SSD (nvme0n1, ext4). One table, deliberately bigger than the cache:<br>postgres=# SELECT pg_size_pretty(pg_relation_size('leads')), pg_relation_size('leads')/8192;<br>pg_size_pretty | ?column?<br>----------------+----------<br>115 MB | 14706<br>115 MB of heap, 14,706 pages of 8 KB each, against 16 MB of shared buffers. A full read of this table cannot fit in Postgres&rsquo; own cache. That is the point.<br>The table is modeled on the one that broke us: an id, an account_id, and a fat payload. There is only a plain B-tree index on account_id. The two worst queries in the incident did the same thing: filter by account_id with the index, then apply a second filter on a non-indexed column after fetching the rows. This is what I&rsquo;m reproducing here:<br>The pages<br>The index on account_id knows where an account&rsquo;s rows are, but it doesn&rsquo;t know what is inside them. For account 42:<br>postgres=# SELECT count(*), count(DISTINCT (ctid::text::point)[0]) AS heap_pages<br>postgres-# FROM leads WHERE account_id = 42;<br>count | heap_pages<br>-------+------------<br>500 | 500<br>500 rows - one per page - scattered accross 500 distinct heap pages. To read this account, Postgres reads the index to learn the row locations, then fetches 500 separate pages from the heap. The query I run adds a filter on payload, which the index cannot evaluate, so all 500 rows are fetched and then discarded:<br>Index Scan using idx_account_id on leads<br>Index Cond: (account_id = 42)<br>Filter: (payload ~~ '%zzzz%'::text)<br>Rows Removed by Filter: 500<br>500 pages read and zero rows returned - which means 4 MB of I/O for absolutely nothing. The plan touches 504 buffers: 500 heap pages plus 4 index pages. Now I run that same query in three cache states and read the Buffers line each time.<br>Layer 1: shared buffers<br>Let&rsquo;s run the query a second time, while the pages are still warm:<br>Index Scan using idx_account_id on leads (actual time=0.534..0.534 rows=0)<br>Buffers: shared hit=504<br>Execution Time: 0.569 ms<br>shared hit=504 - every page was already in the process memory, so there was no system call at all. pg_buffercache confirms they are resident:<br>postgres=# SELECT c.relname, count(*) FROM pg_buffercache b<br>postgres-# JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)<br>postgres-# WHERE c.relname IN ('leads','idx_account_id') GROUP BY 1;<br>relname | count<br>----------------+-------<br>leads | 500<br>idx_account_id | 5<br>0.57 ms. Good.<br>Layer 2: the OS page cache<br>Now let&rsquo;s drop the pages from shared buffers - restarting wipes it because shared buffers is process memory. This will not wipe the OS page cache, for obvious reasons. With cold shared buffers and warm OS cache:<br>Index Scan using idx_account_id on leads (actual time=2.292..2.293 rows=0)<br>Buffers: shared read=504<br>I/O Timings: shared read=2.024<br>Execution Time: 2.316 ms<br>shared read=504 now, not hit. From Postgres&rsquo; point of view these were 504 misses: it called read(2) 504 times that reached the kernel. But look at the timing - 2.024 ms for all 504 reads, about 4 microseconds per read, and no disk was touched. The kernel served every block from its own page cache.<br>This makes clear the mechanism behind the &ldquo;shared buffer miss&rdquo;, but without the actual disk read.<br>Layer 3: the disk<br>This is the layer Heroku hid from me, but on my own box I can reach it. Let&rsquo;s drop the OS page cache too, so the read has nowhere left to go but the SSD: restart Postgres to clear shared buffers, then echo 3 > /proc/sys/vm/drop_caches to clear the OS page cache, and run the same query again:<br>Index Scan using idx_account_id on leads (actual time=39.991..39.991 rows=0)<br>Buffers: shared read=504<br>I/O Timings: shared read=39.502<br>Execution Time: 40.276 ms<br>It shows the same 504 reads and same plan, but now I/O Timings is 39.5 ms - about 78 microseconds per read, twenty times slower than the page cache, because every one of those 504 reads went to the NVMe.<br>Let&rsquo;s check that number against the bare device with fio, random 8 KB reads with --direct=1 to bypass the page cache:<br>frn@debian:~$ fio...

read shared postgres cache buffers page

Related Articles