go-cdc-chunkers v1.1.0: faster, leaner, and provably correct · Plakar | The Open Standard for Backup and Restore
Search...
Book a demo<br>Demo
Book a demo<br>Demo
go-cdc-chunkers v1.1.0: faster, leaner, and provably correct
Gilles Chehade
1961 words<br>June 14, 2026
10 min read
Edit this page on GitHub
You liked this article?
Share it
Join us
Help us
Gilles Chehade<br>CTO @ Plakar Korp, R&D software engineer
TL;DR:
go-cdc-chunkers v1.1.0 is out.<br>It is the little library that sits at the very bottom of Plakar: the thing that decides where to cut a stream of bytes into chunks so we can deduplicate them.<br>This release adds spec-faithful, versioned variants of the algorithms we ship (jc-v1.1.0, ultracdc-v1.0.0, fastcdc-v1.0.0), a new NewChunkerBuffer API that lets callers own the scan buffer, and a serious investment in correctness: 100% library test coverage, a fuzz target, and cross-language conformance vectors.<br>The headline numbers: our new JC variant chunks 1 GiB of random data at 3747 MB/s , and pooling buffers across concurrent workers cuts peak RSS by roughly a third while reducing allocations by ~100× .
If you have followed Plakar for a while, you know that almost everything interesting about a deduplicating backup engine happens before we ever write a byte to a store.<br>The question that matters is: given a stream of data, how do you split it into chunks such that, when the data changes a little, only a few chunks change?
That is the entire job of content-defined chunking (CDC), and it is the entire job of go-cdc-chunkers.
It is a small library, and it has been quietly doing its work inside Plakar for a long time.<br>But “small” and “unimportant” are not the same thing.<br>This is the hottest path in the whole system: every single byte you back up flows through a chunker.<br>A few percent of throughput here, or a few hundred megabytes of RAM there, shows up directly in your backup times and your machine’s load.
So we sat down and gave it the attention it deserves.
A quick refresher on content-defined chunking
The naive way to split a file is to cut it every N bytes: fixed-size chunking.<br>It is fast and trivial, and it is terrible for deduplication.<br>Insert a single byte at the start of a file and every subsequent boundary shifts, so every chunk changes, and your “incremental” backup re-uploads the whole file.
Content-defined chunking solves this by deciding boundaries based on the content of the data rather than its offset.<br>You slide a window over the bytes, compute a rolling hash, and cut a boundary whenever the hash satisfies some condition.<br>Now if you insert a byte, only the chunk around the insertion changes; everything before and after re-aligns naturally.
There is a whole family of these algorithms, each trading off speed, boundary quality, and chunk-size distribution differently:
FastCDC — the workhorse, using a Gear-based rolling hash with a normalized chunk-size distribution.
KFastCDC — a keyed variant of FastCDC that derives the Gear table from a key, so two repositories with different keys cut at different boundaries.
UltraCDC — trades a little throughput for more uniform chunk sizes and fewer, larger chunks.
JC — a more recent design that, in our benchmarks, turns out to be remarkably fast.
go-cdc-chunkers exposes all of them behind a single, boring interface, which is exactly what you want from a building block:
chunker, err := chunkers.NewChunker("fastcdc", rd)<br>if err != nil {<br>log.Fatal(err)
for {<br>chunk, err := chunker.Next()<br>if err != nil && err != io.EOF {<br>log.Fatal(err)<br>// ... use chunk ...<br>if err == io.EOF {<br>break
Swap "fastcdc" for "ultracdc", "jc", or "kfastcdc" and nothing else changes.
Spec-faithful, versioned variants
The first big theme of v1.1.0 is correctness by specification.
Over time, “FastCDC” and friends drift.<br>Everyone’s implementation accumulates small deviations from the published papers: a slightly different mask, an off-by-one in the window, a tweaked normalization threshold.<br>Each deviation might be defensible on its own, but the result is that “FastCDC” stops meaning one specific thing.<br>That is a problem for a backup tool, because the chunk boundaries are part of your data’s identity: change them and you change which chunks dedup against which.
So in this release we introduced spec-faithful, explicitly versioned variants :
jc-v1.1.0 — a spec-faithful implementation of the JC algorithm.
ultracdc-v1.0.0 — a spec-faithful UltraCDC (which, along the way, fixed a short-tail panic on tiny inputs).
fastcdc-v1.0.0 — our reference FastCDC, pinned.
The point of the version suffix is that it is a contract.<br>ultracdc-v1.0.0 will always cut bytes the same way, forever.<br>If we ever want to improve the algorithm, that becomes ultracdc-v1.1.0, a new name, opting in deliberately, never silently re-chunking data underneath anyone.
This matters far more for a backup engine than for a...