Sami Lehtinen - Fastsync: How I Doubled rsync's Speed
Search this site
Embedded Files
Skip to main content
Skip to navigation
Sami Lehtinen
Writing fastsync: Why is copying files still so slow, and how I doubled rsync’s speed
I wrote a tool called fastsync when I finally got too frustrated with rsync’s poor local copying performance.<br>This post goes through the making of the tool, the root cause I found, and why proper cache management is apparently the secret sauce that standard tools seem to miss.
Initial Phase
When I do server tasks, I always calculate a mental baseline. If I need to copy X bytes of data over a local bus, based on the disk speeds, I expect it to take N time units.<br>N hours later... wtf. Why is this only 50% done?<br>An initial and immediate iostat analysis revealed the ugly truth: I was getting less than 50% utilization on the disks. Thanks to rsync.
Astonishment
I couldn’t believe it. I am sure someone has thought about copying files efficiently before. I can't be the first sysadmin to want to push two disks to their limits. So why is it so slow?<br>The observed problem in a traditional copy loop looks like this:<br>1. Pick a file<br>2. Read a chunk<br>3. Transfer / process<br>4. Write a chunk<br>5. Wait for sync / completion<br>This causes a clear read/write turn-taking behavior. It blocks efficient parallel copying, and leaves resources underutilized. It’s like those old single-threaded programs: read something, fetch from the network, process, write... and you look at top seeing "100% utilization" of a single thread, while the disk is at 0.1%, the network at 0.5%, and the CPU is just sitting in a wait state. We are creating an artificial bottleneck.
The Fix: First Attempt
I decided to write an alternate program in Python that decoupled these steps efficiently. I built a pipeline where:<br>- A dedicated thread reads<br>- A queue transfers data<br>- A dedicated thread writes<br>All happening independently, with a configurable block size and a bounded transfer queue depth. I also used a temporary file for writing, moving it into place via an atomic os.replace() at the end, so broken files never end up in the destination.
Second Astonishment
I did the full decoupling. Why was it still slow? This was confusing.<br>I experimented and increased the transfer queue depth massively - basically buffering the whole file in RAM (up to 1 GiB).<br>Boom. Now it worked as expected. 100% utilization. The disks were reading and writing concurrently.<br>But why? Increasing the transfer buffer to absurd sizes shouldn't be the right way to fix a pipeline. This technically worked, and the initial problem was solved, but it kept nagging me. Why did it require such a huge queue depth to prevent the pipeline from stalling?
The Next Day: The Secret Sauce
The next day, based on my experience with generic system optimizations, I decided to tackle cache management.<br>I strongly suspected the stalling was related to the kernel’s delayed allocation and fsync combo. Delayed allocation gathers dirty pages in RAM, and then the filesystem or the flush daemon forces a massive wait until that data is physically committed to the disk, blocking other I/O.<br>So, I added explicit posix_fadvise() hints to my Python script. This is the secret sauce that rsync is NOT(?) doing by default for large sequential transfers:<br>_fadvise(src_fd, 0, 0, 'POSIX_FADV_SEQUENTIAL')<br>_fadvise(src_fd, 0, 0, 'POSIX_FADV_NOREUSE')<br>_fadvise(dst_fd, 0, 0, 'POSIX_FADV_SEQUENTIAL')<br>_fadvise(dst_fd, 0, 0, 'POSIX_FADV_NOREUSE')<br>Note: I also implemented a rolling POSIX_FADV_DONTNEED window trailing behind the read/write heads to drop page cache progressively, rather than waiting for EOF.<br>I tested it with the large queue. It worked.<br>Then, I dropped the queue depth back down to something sane - like 4 blocks of 1 MiB.<br>It still worked. Perfectly.<br>This confirmed my theory. The root cause of the slowness was a lack of explicit cache management. By calling posix_fadvise and managing the cache properly, the kernel stops pausing the pipeline to flush massive dirty page buffers. We bypassed the latency-inducing I/O operations. This doubled the performance, cut the transfer time in half, and shrank my required memory buffer from 1 GiB down to 8 MiB.<br>It sometimes feels like this goes into the category of "why my Python scripts are often faster than Java, .NET, or C++ programs". It’s not because the language is faster, it's because the program just does things smarter. I can't believe there are still improvements this big just sitting there, seemingly unnoticed by mainstream tools.
Factors and Addressing the Skeptics
When I posted about this on the Fediverse (Mastodon/Pleroma), a few people were understandably skeptical. “Rsync has been maintained by experienced system programmers for 30 years... are you sure you just beat them with fadvise hints?”<br>They asked excellent questions about bottlenecks, workloads, and environments. So, to be clear, "fastsync is 2x faster" applies because my environment...