I taught a bucket to speak git | Tigris Object Storage
Skip to main content
What happens if I just point a git server at an object storage bucket?
Back when I was porting<br>agent sandboxes to Go, I<br>built everything on top of<br>billy, a filesystem<br>abstraction for Go. The whole trick of the project was teaching a Tigris bucket<br>to act enough like a filesystem that a shell interpreter and its tools couldn’t<br>tell the difference. Billy was the key layer that made the entire façade fall<br>into place.
After I had gotten things working, I learned that I’m using billy way outside<br>its normal usecase. It was originally made for<br>go-git, a pure-Go<br>implementation of git’s protocols and data formats. It doesn’t rely on the<br>/usr/bin/git binary existing at all. Every method on billy’s filesystem<br>interface exists purely because go-git needs it. This gave me a terrible idea: I<br>already have a bucket that can quack like a filesystem and go-git’s native<br>language is “filesystem”.
Can this Just Work™? Let's find out.
Git was always an object store
If you strip away the porcelain, a git repository is 4 basic things:
Objects, or compressed blobs of data. Most of the objects in any individual<br>repository are files.
Trees, or objects that map to other objects. TL;DR: trees are folders.
Commits, or objects that point at one tree and their parent commit. This lets<br>you pin down which files belong to one logical change set.
Refs, branches and tags, they are tiny mutable pointers into the pile of<br>objects.
note<br>Until I started working on this I was under the impression that git stored only<br>the patches done to an empty folder and that was how it reconstructed the<br>history of your repository. It does not. It actually keeps track of the entire<br>files, which explains why big binary blobs fudge the tooling so much. The diff<br>mental model works fine for using git day to day; it’s just wrong at the storage<br>layer, which is the layer this post lives in.
For example, let’s say I just made a new git repository and committed a<br>README.md to it. The tree for the .git folder looks something like this:
$ tree .git
.git
├── COMMIT_EDITMSG
├── config
├── HEAD
├── index
├── objects
│ ├── 5e
│ │ └── b8151eb669aa4467b6dea2c4bce19183cd0b41
│ ├── 6a
│ │ └── 6a8ecfcae2632152486aca3d9150ef83dedd66
│ ├── f4
│ │ └── d2487a1c6d742c8037c0296ddf80625190bd80
│ ├── info
│ └── pack
└── refs
├── heads
│ └── main
└── tags
As you can see there are three objects. One of them is the commit<br>5eb8151eb669aa4467b6dea2c4bce19183cd0b41, the next is the tree, and the last<br>one is the README file. The main branch also points to that commit:
$ cat .git/refs/heads/main
5eb8151eb669aa4467b6dea2c4bce19183cd0b41
The cool part is that half of this is content-addressed. The content-addressed<br>bits never change once they’ve been committed. Git objects are a great fit for<br>Tigris’ internal model because they are append-only storage, just like<br>the fundamental model Tigris is built upon.<br>The things that do change often are the refs, which are updated to point to the<br>latest commit. These are tiny files though, which means that Tigris can handle<br>them with no effort required.
However, when we host git repositories on a server, we end up creating single<br>points of failure. Our git repos are hosted on single machines that can and will<br>break. The entire implementation relies on git objects being 1:1 correlated with<br>filesystem objects because everyone (even GitHub) shells out to the git binary<br>to actually store files. Hosting git repos becomes one of the most stateful<br>services in our stateless cloud-native environment.
Sure git is in-theory decentralized, but most of us have ended up using that to<br>put our git repositories in one big store that has questionable uptime<br>practices: GitHub. To be fair to hubbers, GitHub operates at a scale that none<br>of us can really think about. They’ve been pushing the limits since their<br>inception where they had to get Engine Yard to keep building them bigger servers<br>to handle the load. They have to do everything with a big mounted filesystem<br>because git’s tooling gives them no other option.
A travesty of horrors beyond human comprehension
Now suppose this weirdness bothers you enough to do something about it. To build<br>a git server without storing everything in the local filesystem, you have to<br>speak git somehow, and the conventional options aren’t really all that great:
If you shell out to the git binary, now your “library” is the argv of the git<br>process and your error handling is screen-scraping output. Internally, git<br>implements its functionality with a billionty subcommands rather than exposing<br>it all as a library. The codebase is held together by load-bearing calls to<br>die(), which kills the process.
If you link into git’s guts with libgit, you inherit the “when things go<br>bad, die()” behaviour and your app now suddenly starts crashing at random.<br>This is not good for uptime.
If you try to use libgit2 (the...