Kafka's log compaction corrupts data. Here's how we fixed it
Solutions
Learn
Company
Start free<br>CLOUD LOGINSUPPORT
Get started
Engineering
Apache Kafka's log compaction corrupts data. Here's how we fixed it
A detailed look at the bug we found and the compaction algorithm that solved it
by<br>Alexey Bashtanov
June 25, 2026
Last modified on
Analyze or share this content:
Copied!
Your browser does not support the video tag.
TL;DR Takeaways:
No items found.
Learn more at Redpanda University
In compacted topics, Apache Kafka® retains only the latest value for each key. Tombstones (records with a null value) can be used to express a deletion of a key. Once compaction has deleted all the value records, Kafka waits for at least delete.retention.ms and removes tombstones as well. This approach prevents bloating the topic with tombstones for long-gone keys.<br>But there’s a problem (actually, there are four). In this post, we describe the bug we found and how coordinated compaction solves it in Redpanda Streaming.<br>How Kafka log compaction works<br>To understand the problem, here’s a brief explanation of how Kafka’s log compaction currently works.<br>Compaction affects transaction control batches. In a transactional write, a producer first writes the data records (possibly across several partitions), then appends a COMMIT or ABORT control batch to each partition. Consumers running with isolation.level=read_committed use those markers to decide whether to deliver the transaction's records or hide them.<br>Control batches sit in the log like ordinary records, and in a compacted topic, compaction applies the same expiration-based rules to them: once the data they resolve has been compacted and enough time has passed, the marker can be removed as well. This allows efficient cleanup of old data and metadata. Not only are data records and tombstones for removed keys compacted away, but also associated transaction control batches.<br>Tombstones and COMMIT/ABORT control batches are the only signals that their associated records were deleted, committed, or aborted, respectively. Once a tombstone or a control batch is compacted away, this information is gone.<br>This can lead to catastrophic consequences: compaction may remove a tombstone or a control batch on one replica while another still needs it. Each broker compacts its own log independently. A replica that lacks a tombstone or marker still retains associated records. When it rejoins, the leader no longer has the tombstone or the marker to replicate. The replicas then permanently disagree about what's in the log, and which version a consumer sees depends on which broker is the leader at read time.<br>The bug reproduces reliably on Kafka 3.9 through 4.2. We've found four variants, ranging from "deleted data reappears" to "aborted data is served as committed". Next, we’ll describe all four, walk through a reproducer for one of them, and explain how we closed the gap.<br>The root cause: compaction–replication race<br>When a broker falls behind or goes offline, it drops out of the ISR (in-sync replica set). Meanwhile, the remaining brokers keep accepting writes and keep compacting as usual. If a critical record (tombstone, COMMIT marker, or ABORT marker) is written while one replica is unavailable—and compaction removes it before the replica catches up—the replica never learns about it. From its point of view, the record never existed.<br>Kafka's safeguards are time-based. A tombstone becomes removable delete.retention.ms (default 24 hours) after it is written. For transaction control batches, cleanup happens in two steps:<br>After delete.retention.ms, the marker batch itself is replaced by an empty batch that still carries the producer ID and the COMMIT/ABORT flag in its header.<br>After producer.id.expiration.ms (also 24 hours by default, timed from the last producer activity), the empty batch may also be discarded.<br>A broker that's offline or lagging past these timers (due to a hardware failure, a long maintenance window, or a slow recovery) will miss both the marker and its empty-batch remnant, with no way to recover.<br>We observed four manifestations of this problem, depending on which metadata record is lost. Each scenario below involves a 3-broker cluster in which Broker 2 goes offline for a prolonged period.<br>Issue 1: Tombstone divergence, deleted data reappears<br>A tombstone for key K is written while Broker 2 is down. Brokers 1 and 3 compact away both the original value and the tombstone. When Broker 2 rejoins, there is no tombstone left to replicate, so it keeps the original record. Brokers 1 and 3 consider K deleted; Broker 2 serves K=V. Which one a consumer sees depends on who the leader is.<br>Issue 2: Aborted-to-committed, aborted data served as committed<br>A producer does two transactions with the same transactional.id:<br>TX1: produce poison=SHOULD_NOT_SEE_THIS, then ABORT .<br>TX2: produce good=data, then COMMIT .<br>If Broker 2 misses the ABORT marker for TX1 and it's compacted away on other...