Fastsync: I doubled rsync's local copy speed

Sami Lehtinen - Fastsync: How I Doubled rsync's Speed

Search this site

Embedded Files

Skip to navigation

Sami Lehtinen

Writing fastsync: Why is copying files still so slow, and how I doubled rsync’s speed

I wrote a tool called fastsync when I finally got too frustrated with rsync’s poor local copying performance. This post goes through the making of the tool, the root cause I found, and why proper cache management is apparently the secret sauce that standard tools seem to miss.

Initial Phase

When I do server tasks, I always calculate a mental baseline. If I need to copy X bytes of data over a local bus, based on the disk speeds, I expect it to take N time units. N hours later... wtf. Why is this only 50% done? An initial and immediate iostat analysis revealed the ugly truth: I was getting less than 50% utilization on the disks. Thanks to rsync.

Astonishment

I couldn’t believe it. I am sure someone has thought about copying files efficiently before. I can't be the first sysadmin to want to push two disks to their limits. So why is it so slow? The observed problem in a traditional copy loop looks like this: 1. Pick a file 2. Read a chunk 3. Transfer / process 4. Write a chunk 5. Wait for sync / completion This causes a clear read/write turn-taking behavior. It blocks efficient parallel copying, and leaves resources underutilized. It’s like those old single-threaded programs: read something, fetch from the network, process, write... and you look at top seeing "100% utilization" of a single thread, while the disk is at 0.1%, the network at 0.5%, and the CPU is just sitting in a wait state. We are creating an artificial bottleneck.

The Fix: First Attempt

I decided to write an alternate program in Python that decoupled these steps efficiently. I built a pipeline where: - A dedicated thread reads - A queue transfers data - A dedicated thread writes All happening independently, with a configurable block size and a bounded transfer queue depth. I also used a temporary file for writing, moving it into place via an atomic os.replace() at the end, so broken files never end up in the destination.

Second Astonishment

I did the full decoupling. Why was it still slow? This was confusing. I experimented and increased the transfer queue depth massively - basically buffering the whole file in RAM (up to 1 GiB). Boom. Now it worked as expected. 100% utilization. The disks were reading and writing concurrently. But why? Increasing the transfer buffer to absurd sizes shouldn't be the right way to fix a pipeline. This technically worked, and the initial problem was solved, but it kept nagging me. Why did it require such a huge queue depth to prevent the pipeline from stalling?

The Next Day: The Secret Sauce

The next day, based on my experience with generic system optimizations, I decided to tackle cache management. I strongly suspected the stalling was related to the kernel’s delayed allocation and fsync combo. Delayed allocation gathers dirty pages in RAM, and then the filesystem or the flush daemon forces a massive wait until that data is physically committed to the disk, blocking other I/O. So, I added explicit posix_fadvise() hints to my Python script. This is the secret sauce that rsync is NOT(?) doing by default for large sequential transfers: _fadvise(src_fd, 0, 0, 'POSIX_FADV_SEQUENTIAL') _fadvise(src_fd, 0, 0, 'POSIX_FADV_NOREUSE') _fadvise(dst_fd, 0, 0, 'POSIX_FADV_SEQUENTIAL') _fadvise(dst_fd, 0, 0, 'POSIX_FADV_NOREUSE') Note: I also implemented a rolling POSIX_FADV_DONTNEED window trailing behind the read/write heads to drop page cache progressively, rather than waiting for EOF. I tested it with the large queue. It worked. Then, I dropped the queue depth back down to something sane - like 4 blocks of 1 MiB. It still worked. Perfectly. This confirmed my theory. The root cause of the slowness was a lack of explicit cache management. By calling posix_fadvise and managing the cache properly, the kernel stops pausing the pipeline to flush massive dirty page buffers. We bypassed the latency-inducing I/O operations. This doubled the performance, cut the transfer time in half, and shrank my required memory buffer from 1 GiB down to 8 MiB. It sometimes feels like this goes into the category of "why my Python scripts are often faster than Java, .NET, or C++ programs". It’s not because the language is faster, it's because the program just does things smarter. I can't believe there are still improvements this big just sitting there, seemingly unnoticed by mainstream tools.

Factors and Addressing the Skeptics

When I posted about this on the Fediverse (Mastodon/Pleroma), a few people were understandably skeptical. “Rsync has been maintained by experienced system programmers for 30 years... are you sure you just beat them with fadvise hints?” They asked excellent questions about bottlenecks, workloads, and environments. So, to be clear, "fastsync is 2x faster" applies because my environment...

Fastsync: I doubled rsync's local copy speed

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy