Adrian Sampson: A Fake Shell for Pangenomics
A Fake Shell for Pangenomics
June 25, 2026
I have been working on an efficient toolkit for pangenomics, called FlatGFA.<br>Relative to other pangenomics tools like odgi, FlatGFA has only one trick:<br>a zero-copy data format.<br>The in-memory data format is identical to the on-disk format, so FlatGFA can skip all serialization and deserialization costs;<br>opening a file consists of an mmap(2).<br>For unfairly cherry-picked workloads, FlatGFA can be thousands of times faster than odgi.
Now comes the hard part:<br>I want my genomicist colleagues to actually use FlatGFA.<br>I want to write an inventory of high-performance operations and let the real scientists compose them into complete workflows.
To let them do that kind of composition, there were two simple options:<br>we could either<br>(1) make a command-line interface that exposes all the operators and let the scientists write shell scripts to compose them, or<br>(2) design a Rust API and have the scientists write Rust code.<br>Neither option is very compelling:
The CLI approach limits the kind of composition you can do.<br>All intermediates need to be either files or pipes,<br>which can get awkward and surely comes with some overhead.
Our internal Rust API uses, because of all the data-structure tricks we play, an endearingly idiosyncratic style.<br>Even though our biologist collaborators are great Rust hackers, I can’t in good conscience say that we have a good API that they’d be happy to use.
This post is about the very silly alternative that we recently built:<br>a fake shell that pretends to offer option 1 but approximates the performance of option 2.
On Ousterhout Dichotomies
For a long time, I thought that the right way to “package” a performance-oriented library like FlatGFA might be with a standard Ousterhout dichotomy.<br>The performance-sensitive routines stay in Rust, but we build bindings to a higher-level language for composing those routines into whole workflows.<br>The result would look a lot like PyTorch: it doesn’t matter to ML engineers that Python isn’t very fast because more than 99% of the time is spent in optimized kernel routines written in C++ and CUDA.
Python is the natural choice for the “glue language” part of an Ousterhout dichotomy in the modern era.1<br>So we started building Python bindings for FlatGFA using the excellent PyO3 project.<br>We got the basics working reasonably well—for example, try this to see it in action:
$ curl -LO https://raw.githubusercontent.com/pangenome/odgi/refs/heads/master/test/LPA.gfa<br>$ uv run --with flatgfa python<br>>>> import flatgfa<br>>>> graph = flatgfa.parse("LPA.gfa")<br>>>> [path.name for path in graph.paths]
However, Python bindings had a few serious downsides:
Even with PyO3, the bindings are hard to write efficiently. The problem is the fundamental complexity in the mismatch between Rust’s static lifetimes and Python’s dynamically managed heap. FlatGFA’s performance advantages come from eliminating copies, allocations, and pointer-chasing—all things that want to creep back in at the Rust/Python boundary.
We don’t get a whole-program view of the workload. Straightforward Python bindings mean that our only opportunity to go fast is within each call to the library, and we can’t do much across multiple calls. For example, the moment that the user writes a Python for loop that iterates over a FlatGFA data structure, we almost certainly lose the performance game. This is the same underlying reason that PyTorch has a separate, optional compiled mode, for example.
It turns out that our biologist collaborators aren’t exactly enamored with Python anyway! The traditional, familiar way to compose pangenomic pipelines is via the Unix shell. Personally, I have become too accustomed to Python being the default choice for approachability. Naturally, preferences among domain experts are contextual and varied.
It made sense to reconsider the CLI-oriented approach that odgi and friends all use.
Reconsidering the Shell
Let’s look at an example of shell-based composition in this domain.<br>One tutorial from the odgi documentation shows how to find repetitive sequences in human chromosome 8 by composing operators from odgi itself and bedtools:
odgi depth -i chr8.pan.og -r chm13#chr8 | \<br>bedtools makewindows -b /dev/stdin -w 5000 > chm13.chr8.w5kbps.bed
odgi depth -i chr8.pan.og -b chm13.chr8.w5kbps.bed --threads 2 | \<br>bedtools sort > chr8.pan.depth.w5kbps.bed
It might seem odd to prefer shell scripting over a full-featured dynamic scripting language, but<br>shell scripts like this have some material advantages over Python:
Streaming I/O via pipes can be great for large datasets, in the right situation.
Simple pipeline parallelism is easy to express.
It’s straightforward to persist intermediate results in files.
The shell is kinda the ultimate glue language:<br>you can compose components developed separately,...