I taught a bucket to speak Git

I taught a bucket to speak git | Tigris Object Storage

What happens if I just point a git server at an object storage bucket?

Back when I was porting agent sandboxes to Go, I built everything on top of billy, a filesystem abstraction for Go. The whole trick of the project was teaching a Tigris bucket to act enough like a filesystem that a shell interpreter and its tools couldn’t tell the difference. Billy was the key layer that made the entire façade fall into place.

After I had gotten things working, I learned that I’m using billy way outside its normal usecase. It was originally made for go-git, a pure-Go implementation of git’s protocols and data formats. It doesn’t rely on the /usr/bin/git binary existing at all. Every method on billy’s filesystem interface exists purely because go-git needs it. This gave me a terrible idea: I already have a bucket that can quack like a filesystem and go-git’s native language is “filesystem”.

Can this Just Work™? Let's find out.

Git was always an object store

If you strip away the porcelain, a git repository is 4 basic things:

Objects, or compressed blobs of data. Most of the objects in any individual repository are files.

Trees, or objects that map to other objects. TL;DR: trees are folders.

Commits, or objects that point at one tree and their parent commit. This lets you pin down which files belong to one logical change set.

Refs, branches and tags, they are tiny mutable pointers into the pile of objects.

note Until I started working on this I was under the impression that git stored only the patches done to an empty folder and that was how it reconstructed the history of your repository. It does not. It actually keeps track of the entire files, which explains why big binary blobs fudge the tooling so much. The diff mental model works fine for using git day to day; it’s just wrong at the storage layer, which is the layer this post lives in.

For example, let’s say I just made a new git repository and committed a README.md to it. The tree for the .git folder looks something like this:

$ tree .git

.git

├── COMMIT_EDITMSG

├── config

├── HEAD

├── index

├── objects

│ ├── 5e

│ │ └── b8151eb669aa4467b6dea2c4bce19183cd0b41

│ ├── 6a

│ │ └── 6a8ecfcae2632152486aca3d9150ef83dedd66

│ ├── f4

│ │ └── d2487a1c6d742c8037c0296ddf80625190bd80

│ ├── info

│ └── pack

└── refs

├── heads

│ └── main

└── tags

As you can see there are three objects. One of them is the commit 5eb8151eb669aa4467b6dea2c4bce19183cd0b41, the next is the tree, and the last one is the README file. The main branch also points to that commit:

$ cat .git/refs/heads/main

5eb8151eb669aa4467b6dea2c4bce19183cd0b41

The cool part is that half of this is content-addressed. The content-addressed bits never change once they’ve been committed. Git objects are a great fit for Tigris’ internal model because they are append-only storage, just like the fundamental model Tigris is built upon. The things that do change often are the refs, which are updated to point to the latest commit. These are tiny files though, which means that Tigris can handle them with no effort required.

However, when we host git repositories on a server, we end up creating single points of failure. Our git repos are hosted on single machines that can and will break. The entire implementation relies on git objects being 1:1 correlated with filesystem objects because everyone (even GitHub) shells out to the git binary to actually store files. Hosting git repos becomes one of the most stateful services in our stateless cloud-native environment.

Sure git is in-theory decentralized, but most of us have ended up using that to put our git repositories in one big store that has questionable uptime practices: GitHub. To be fair to hubbers, GitHub operates at a scale that none of us can really think about. They’ve been pushing the limits since their inception where they had to get Engine Yard to keep building them bigger servers to handle the load. They have to do everything with a big mounted filesystem because git’s tooling gives them no other option.

A travesty of horrors beyond human comprehension

Now suppose this weirdness bothers you enough to do something about it. To build a git server without storing everything in the local filesystem, you have to speak git somehow, and the conventional options aren’t really all that great:

If you shell out to the git binary, now your “library” is the argv of the git process and your error handling is screen-scraping output. Internally, git implements its functionality with a billionty subcommands rather than exposing it all as a library. The codebase is held together by load-bearing calls to die(), which kills the process.

If you link into git’s guts with libgit, you inherit the “when things go bad, die()” behaviour and your app now suddenly starts crashing at random. This is not good for uptime.

If you try to use libgit2 (the...

I taught a bucket to speak Git

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi