How Programmers Spend Their Time | Probably Dance
Probably Dance
I can program and like games
How Programmers Spend Their Time<br>by Malte Skarupke
I submitted a tiny patch to flash attention. The necessary typing for the change takes less ten seconds, but the overall change took more than ten hours So where does the time go?
It started when coworker had a bug where cudnn attention would crash randomly. We looked at his unreleased changes and concluded that they couldn’t possibly cause this, so we suspected that we had a lingering bug that was exposed by making harmless changes to related code.
Step 1, a few hours: My coworker tried to figure this out just by running the code repeatedly, trying out various theories. The bug was hard to reproduce so this took hours without much progress.
Step 2, 1 hour: I thought this is a good reason to try out compute sanitizer. It would be easiest to just run it on our existing tests to see if it finds any issues without my coworker’s changes. But the tests run on another box because they require certain GPUs, which means you have to run the tests through some layers. Unfortunately compute sanitizer really wants to be in charge of the program, so we have to convince those layers to let compute sanitizer run the whole thing. It keeps on failing and we can’t figure out why, until eventually I suspect that the issue is that the tests run in a sandbox, and the sandbox is strict enough that it breaks compute sanitizer somehow. This turned out to be true and we probably wasted an hour together.
Step 3, 10 minutes: Run the tests outside of the testing framework. This is surprisingly easy, taking just five minutes. Compute sanitizer immediately finds a problem. Well, almost immediately. You have to know to turn off the pytorch caching allocator because it hides memory issues. If I hadn’t known that, I could have wasted hours more.
Step 4, 10 minutes: Investigate a theory that we had: We were padding one tensor, but not a related tensor that really feels like it should be padded, too. I try to use torch.nn.functional.pad but it doesn’t work for padding the batch-dimension. So we just use torch.expand and torch.cat together. This takes like ten minutes and the bug is still there. Then I notice another tensor that should also be padded, which takes seconds to try out now and finally our cudnn invocation runs cleanly through compute sanitizer. But a nearby test for flash-attention is failing in compute sanitizer.
Step 5, 20 minutes: The padding fix didn’t fix the original issue, so my coworker decides to look more into it on his own and I look more into why flash-attention is having issues. First check if we’re doing something obvious wrong. This takes 10 minutes and I find nothing. Then check the flash-attention code. Compute sanitizer gives me a line number and it fails on an interesting line related to running in deterministic mode. That’s not used often, so maybe that’s why the test is buggy. I tried to understand the index math in that line but that led nowhere, so instead I just grepped for where that variable even comes from, and there is a glaringly obvious use-after-free bug:
The dk_semaphore and dv_sempahore will go away at the end of the scope, but the data_ptrs will still be used and will point into memory that’s no longer valid.
Fixing this would take seconds (just default-construct the tensors outside the "if") but we’re just using flash-attention from pip, so I would have to build a new wheel to confirm the fix.
Step 6, 2 hours: I decide to build this on my home computer because experience shows that it’s easier to get random source code to build on personal computers where I can freely install anything from apt-get or download random things from the Internet. I download the flash attention source but don’t actually know how to build it. "make" doesn’t do anything even though there is a Makefile. The readme says to use "python setupy.py install" which immediately prints a message telling me to not run this command and to use some other thing instead which I hadn’t heard of before. But then it does the work anyway despite that message, so I stick with it. It fails with "unsupported architecture compute_120". I grep for where that comes from, somehow this thinks my PC supports newer things than it actually does. I try disabling it in setup.py, but pytorch does the same thing and I can’t modify that. So instead I try to figure out why it thinks compute_120 is supported when it actually isn’t. Oops, turns out I’m running ancient CUDA 12.0. I decide to upgrade to version 12.9 instead (I avoid 13.0 because that might have unknown compatibility issues). Now the build works, but it’s super slow. After 20 minutes I kill it and rerun it with more parallelism. This OOMs. So I try again with the original setting, which now OOMs as well. So instead I run with even lower parallelism, which makes the build even slower. I decide to call it a night. Unfortunately I can’t...