Content-defined chunking in Go: 3.7 GB/s and a third less memory

go-cdc-chunkers v1.1.0: faster, leaner, and provably correct · Plakar | The Open Standard for Backup and Restore

Search...

Book a demo Demo

go-cdc-chunkers v1.1.0: faster, leaner, and provably correct

Gilles Chehade

1961 words June 14, 2026

10 min read

Edit this page on GitHub

You liked this article?

Share it

Join us

Help us

Gilles Chehade CTO @ Plakar Korp, R&D software engineer

TL;DR:

go-cdc-chunkers v1.1.0 is out. It is the little library that sits at the very bottom of Plakar: the thing that decides where to cut a stream of bytes into chunks so we can deduplicate them. This release adds spec-faithful, versioned variants of the algorithms we ship (jc-v1.1.0, ultracdc-v1.0.0, fastcdc-v1.0.0), a new NewChunkerBuffer API that lets callers own the scan buffer, and a serious investment in correctness: 100% library test coverage, a fuzz target, and cross-language conformance vectors. The headline numbers: our new JC variant chunks 1 GiB of random data at 3747 MB/s , and pooling buffers across concurrent workers cuts peak RSS by roughly a third while reducing allocations by ~100× .

If you have followed Plakar for a while, you know that almost everything interesting about a deduplicating backup engine happens before we ever write a byte to a store. The question that matters is: given a stream of data, how do you split it into chunks such that, when the data changes a little, only a few chunks change?

That is the entire job of content-defined chunking (CDC), and it is the entire job of go-cdc-chunkers.

It is a small library, and it has been quietly doing its work inside Plakar for a long time. But “small” and “unimportant” are not the same thing. This is the hottest path in the whole system: every single byte you back up flows through a chunker. A few percent of throughput here, or a few hundred megabytes of RAM there, shows up directly in your backup times and your machine’s load.

So we sat down and gave it the attention it deserves.

A quick refresher on content-defined chunking

The naive way to split a file is to cut it every N bytes: fixed-size chunking. It is fast and trivial, and it is terrible for deduplication. Insert a single byte at the start of a file and every subsequent boundary shifts, so every chunk changes, and your “incremental” backup re-uploads the whole file.

Content-defined chunking solves this by deciding boundaries based on the content of the data rather than its offset. You slide a window over the bytes, compute a rolling hash, and cut a boundary whenever the hash satisfies some condition. Now if you insert a byte, only the chunk around the insertion changes; everything before and after re-aligns naturally.

There is a whole family of these algorithms, each trading off speed, boundary quality, and chunk-size distribution differently:

FastCDC — the workhorse, using a Gear-based rolling hash with a normalized chunk-size distribution.

KFastCDC — a keyed variant of FastCDC that derives the Gear table from a key, so two repositories with different keys cut at different boundaries.

UltraCDC — trades a little throughput for more uniform chunk sizes and fewer, larger chunks.

JC — a more recent design that, in our benchmarks, turns out to be remarkably fast.

go-cdc-chunkers exposes all of them behind a single, boring interface, which is exactly what you want from a building block:

chunker, err := chunkers.NewChunker("fastcdc", rd) if err != nil { log.Fatal(err)

for { chunk, err := chunker.Next() if err != nil && err != io.EOF { log.Fatal(err) // ... use chunk ... if err == io.EOF { break

Swap "fastcdc" for "ultracdc", "jc", or "kfastcdc" and nothing else changes.

Spec-faithful, versioned variants

The first big theme of v1.1.0 is correctness by specification.

Over time, “FastCDC” and friends drift. Everyone’s implementation accumulates small deviations from the published papers: a slightly different mask, an off-by-one in the window, a tweaked normalization threshold. Each deviation might be defensible on its own, but the result is that “FastCDC” stops meaning one specific thing. That is a problem for a backup tool, because the chunk boundaries are part of your data’s identity: change them and you change which chunks dedup against which.

So in this release we introduced spec-faithful, explicitly versioned variants :

jc-v1.1.0 — a spec-faithful implementation of the JC algorithm.

ultracdc-v1.0.0 — a spec-faithful UltraCDC (which, along the way, fixed a short-tail panic on tiny inputs).

fastcdc-v1.0.0 — our reference FastCDC, pinned.

The point of the version suffix is that it is a contract. ultracdc-v1.0.0 will always cut bytes the same way, forever. If we ever want to improve the algorithm, that becomes ultracdc-v1.1.0, a new name, opting in deliberately, never silently re-chunking data underneath anyone.

This matters far more for a backup engine than for a...

Content-defined chunking in Go: 3.7 GB/s and a third less memory

Related Articles

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7