Content-defined chunking in Go: 3.7 GB/s and a third less memory

vcoisne1 pts0 comments

go-cdc-chunkers v1.1.0: faster, leaner, and provably correct · Plakar | The Open Standard for Backup and Restore

Search...

Book a demo<br>Demo

Book a demo<br>Demo

go-cdc-chunkers v1.1.0: faster, leaner, and provably correct

Gilles Chehade

1961 words<br>June 14, 2026

10 min read

Edit this page on GitHub

You liked this article?

Share it

Join us

Help us

Gilles Chehade<br>CTO @ Plakar Korp, R&D software engineer

TL;DR:

go-cdc-chunkers v1.1.0 is out.<br>It is the little library that sits at the very bottom of Plakar: the thing that decides where to cut a stream of bytes into chunks so we can deduplicate them.<br>This release adds spec-faithful, versioned variants of the algorithms we ship (jc-v1.1.0, ultracdc-v1.0.0, fastcdc-v1.0.0), a new NewChunkerBuffer API that lets callers own the scan buffer, and a serious investment in correctness: 100% library test coverage, a fuzz target, and cross-language conformance vectors.<br>The headline numbers: our new JC variant chunks 1 GiB of random data at 3747 MB/s , and pooling buffers across concurrent workers cuts peak RSS by roughly a third while reducing allocations by ~100× .

If you have followed Plakar for a while, you know that almost everything interesting about a deduplicating backup engine happens before we ever write a byte to a store.<br>The question that matters is: given a stream of data, how do you split it into chunks such that, when the data changes a little, only a few chunks change?

That is the entire job of content-defined chunking (CDC), and it is the entire job of go-cdc-chunkers.

It is a small library, and it has been quietly doing its work inside Plakar for a long time.<br>But &ldquo;small&rdquo; and &ldquo;unimportant&rdquo; are not the same thing.<br>This is the hottest path in the whole system: every single byte you back up flows through a chunker.<br>A few percent of throughput here, or a few hundred megabytes of RAM there, shows up directly in your backup times and your machine&rsquo;s load.

So we sat down and gave it the attention it deserves.

A quick refresher on content-defined chunking

The naive way to split a file is to cut it every N bytes: fixed-size chunking.<br>It is fast and trivial, and it is terrible for deduplication.<br>Insert a single byte at the start of a file and every subsequent boundary shifts, so every chunk changes, and your &ldquo;incremental&rdquo; backup re-uploads the whole file.

Content-defined chunking solves this by deciding boundaries based on the content of the data rather than its offset.<br>You slide a window over the bytes, compute a rolling hash, and cut a boundary whenever the hash satisfies some condition.<br>Now if you insert a byte, only the chunk around the insertion changes; everything before and after re-aligns naturally.

There is a whole family of these algorithms, each trading off speed, boundary quality, and chunk-size distribution differently:

FastCDC — the workhorse, using a Gear-based rolling hash with a normalized chunk-size distribution.

KFastCDC — a keyed variant of FastCDC that derives the Gear table from a key, so two repositories with different keys cut at different boundaries.

UltraCDC — trades a little throughput for more uniform chunk sizes and fewer, larger chunks.

JC — a more recent design that, in our benchmarks, turns out to be remarkably fast.

go-cdc-chunkers exposes all of them behind a single, boring interface, which is exactly what you want from a building block:

chunker, err := chunkers.NewChunker("fastcdc", rd)<br>if err != nil {<br>log.Fatal(err)

for {<br>chunk, err := chunker.Next()<br>if err != nil && err != io.EOF {<br>log.Fatal(err)<br>// ... use chunk ...<br>if err == io.EOF {<br>break

Swap "fastcdc" for "ultracdc", "jc", or "kfastcdc" and nothing else changes.

Spec-faithful, versioned variants

The first big theme of v1.1.0 is correctness by specification.

Over time, &ldquo;FastCDC&rdquo; and friends drift.<br>Everyone&rsquo;s implementation accumulates small deviations from the published papers: a slightly different mask, an off-by-one in the window, a tweaked normalization threshold.<br>Each deviation might be defensible on its own, but the result is that &ldquo;FastCDC&rdquo; stops meaning one specific thing.<br>That is a problem for a backup tool, because the chunk boundaries are part of your data&rsquo;s identity: change them and you change which chunks dedup against which.

So in this release we introduced spec-faithful, explicitly versioned variants :

jc-v1.1.0 — a spec-faithful implementation of the JC algorithm.

ultracdc-v1.0.0 — a spec-faithful UltraCDC (which, along the way, fixed a short-tail panic on tiny inputs).

fastcdc-v1.0.0 — our reference FastCDC, pinned.

The point of the version suffix is that it is a contract.<br>ultracdc-v1.0.0 will always cut bytes the same way, forever.<br>If we ever want to improve the algorithm, that becomes ultracdc-v1.1.0, a new name, opting in deliberately, never silently re-chunking data underneath anyone.

This matters far more for a backup engine than for a...

fastcdc chunk ultracdc chunking chunkers backup

Related Articles