A Fake Shell for Pangenomics

matt_d1 pts0 comments

Adrian Sampson: A Fake Shell for Pangenomics

A Fake Shell for Pangenomics

June 25, 2026

I have been working on an efficient toolkit for pangenomics, called FlatGFA.<br>Relative to other pangenomics tools like odgi, FlatGFA has only one trick:<br>a zero-copy data format.<br>The in-memory data format is identical to the on-disk format, so FlatGFA can skip all serialization and deserialization costs;<br>opening a file consists of an mmap(2).<br>For unfairly cherry-picked workloads, FlatGFA can be thousands of times faster than odgi.

Now comes the hard part:<br>I want my genomicist colleagues to actually use FlatGFA.<br>I want to write an inventory of high-performance operations and let the real scientists compose them into complete workflows.

To let them do that kind of composition, there were two simple options:<br>we could either<br>(1) make a command-line interface that exposes all the operators and let the scientists write shell scripts to compose them, or<br>(2) design a Rust API and have the scientists write Rust code.<br>Neither option is very compelling:

The CLI approach limits the kind of composition you can do.<br>All intermediates need to be either files or pipes,<br>which can get awkward and surely comes with some overhead.

Our internal Rust API uses, because of all the data-structure tricks we play, an endearingly idiosyncratic style.<br>Even though our biologist collaborators are great Rust hackers, I can&rsquo;t in good conscience say that we have a good API that they&rsquo;d be happy to use.

This post is about the very silly alternative that we recently built:<br>a fake shell that pretends to offer option 1 but approximates the performance of option 2.

On Ousterhout Dichotomies

For a long time, I thought that the right way to &ldquo;package&rdquo; a performance-oriented library like FlatGFA might be with a standard Ousterhout dichotomy.<br>The performance-sensitive routines stay in Rust, but we build bindings to a higher-level language for composing those routines into whole workflows.<br>The result would look a lot like PyTorch: it doesn&rsquo;t matter to ML engineers that Python isn&rsquo;t very fast because more than 99% of the time is spent in optimized kernel routines written in C++ and CUDA.

Python is the natural choice for the &ldquo;glue language&rdquo; part of an Ousterhout dichotomy in the modern era.1<br>So we started building Python bindings for FlatGFA using the excellent PyO3 project.<br>We got the basics working reasonably well—for example, try this to see it in action:

$ curl -LO https://raw.githubusercontent.com/pangenome/odgi/refs/heads/master/test/LPA.gfa<br>$ uv run --with flatgfa python<br>>>> import flatgfa<br>>>> graph = flatgfa.parse("LPA.gfa")<br>>>> [path.name for path in graph.paths]

However, Python bindings had a few serious downsides:

Even with PyO3, the bindings are hard to write efficiently. The problem is the fundamental complexity in the mismatch between Rust&rsquo;s static lifetimes and Python&rsquo;s dynamically managed heap. FlatGFA&rsquo;s performance advantages come from eliminating copies, allocations, and pointer-chasing—all things that want to creep back in at the Rust/Python boundary.

We don&rsquo;t get a whole-program view of the workload. Straightforward Python bindings mean that our only opportunity to go fast is within each call to the library, and we can&rsquo;t do much across multiple calls. For example, the moment that the user writes a Python for loop that iterates over a FlatGFA data structure, we almost certainly lose the performance game. This is the same underlying reason that PyTorch has a separate, optional compiled mode, for example.

It turns out that our biologist collaborators aren&rsquo;t exactly enamored with Python anyway! The traditional, familiar way to compose pangenomic pipelines is via the Unix shell. Personally, I have become too accustomed to Python being the default choice for approachability. Naturally, preferences among domain experts are contextual and varied.

It made sense to reconsider the CLI-oriented approach that odgi and friends all use.

Reconsidering the Shell

Let&rsquo;s look at an example of shell-based composition in this domain.<br>One tutorial from the odgi documentation shows how to find repetitive sequences in human chromosome 8 by composing operators from odgi itself and bedtools:

odgi depth -i chr8.pan.og -r chm13#chr8 | \<br>bedtools makewindows -b /dev/stdin -w 5000 > chm13.chr8.w5kbps.bed

odgi depth -i chr8.pan.og -b chm13.chr8.w5kbps.bed --threads 2 | \<br>bedtools sort > chr8.pan.depth.w5kbps.bed

It might seem odd to prefer shell scripting over a full-featured dynamic scripting language, but<br>shell scripts like this have some material advantages over Python:

Streaming I/O via pipes can be great for large datasets, in the right situation.

Simple pipeline parallelism is easy to express.

It&rsquo;s straightforward to persist intermediate results in files.

The shell is kinda the ultimate glue language:<br>you can compose components developed separately,...

flatgfa rsquo python shell odgi rust

Related Articles