I taught a bucket to speak Git

xena3 pts0 comments

I taught a bucket to speak git | Tigris Object Storage

Skip to main content

What happens if I just point a git server at an object storage bucket?

Back when I was porting<br>agent sandboxes to Go, I<br>built everything on top of<br>billy, a filesystem<br>abstraction for Go. The whole trick of the project was teaching a Tigris bucket<br>to act enough like a filesystem that a shell interpreter and its tools couldn’t<br>tell the difference. Billy was the key layer that made the entire façade fall<br>into place.

After I had gotten things working, I learned that I’m using billy way outside<br>its normal usecase. It was originally made for<br>go-git, a pure-Go<br>implementation of git’s protocols and data formats. It doesn’t rely on the<br>/usr/bin/git binary existing at all. Every method on billy’s filesystem<br>interface exists purely because go-git needs it. This gave me a terrible idea: I<br>already have a bucket that can quack like a filesystem and go-git’s native<br>language is “filesystem”.

Can this Just Work™? Let's find out.

Git was always an object store​

If you strip away the porcelain, a git repository is 4 basic things:

Objects, or compressed blobs of data. Most of the objects in any individual<br>repository are files.

Trees, or objects that map to other objects. TL;DR: trees are folders.

Commits, or objects that point at one tree and their parent commit. This lets<br>you pin down which files belong to one logical change set.

Refs, branches and tags, they are tiny mutable pointers into the pile of<br>objects.

note<br>Until I started working on this I was under the impression that git stored only<br>the patches done to an empty folder and that was how it reconstructed the<br>history of your repository. It does not. It actually keeps track of the entire<br>files, which explains why big binary blobs fudge the tooling so much. The diff<br>mental model works fine for using git day to day; it’s just wrong at the storage<br>layer, which is the layer this post lives in.

For example, let’s say I just made a new git repository and committed a<br>README.md to it. The tree for the .git folder looks something like this:

$ tree .git

.git

├── COMMIT_EDITMSG

├── config

├── HEAD

├── index

├── objects

│ ├── 5e

│ │ └── b8151eb669aa4467b6dea2c4bce19183cd0b41

│ ├── 6a

│ │ └── 6a8ecfcae2632152486aca3d9150ef83dedd66

│ ├── f4

│ │ └── d2487a1c6d742c8037c0296ddf80625190bd80

│ ├── info

│ └── pack

└── refs

├── heads

│ └── main

└── tags

As you can see there are three objects. One of them is the commit<br>5eb8151eb669aa4467b6dea2c4bce19183cd0b41, the next is the tree, and the last<br>one is the README file. The main branch also points to that commit:

$ cat .git/refs/heads/main

5eb8151eb669aa4467b6dea2c4bce19183cd0b41

The cool part is that half of this is content-addressed. The content-addressed<br>bits never change once they’ve been committed. Git objects are a great fit for<br>Tigris’ internal model because they are append-only storage, just like<br>the fundamental model Tigris is built upon.<br>The things that do change often are the refs, which are updated to point to the<br>latest commit. These are tiny files though, which means that Tigris can handle<br>them with no effort required.

However, when we host git repositories on a server, we end up creating single<br>points of failure. Our git repos are hosted on single machines that can and will<br>break. The entire implementation relies on git objects being 1:1 correlated with<br>filesystem objects because everyone (even GitHub) shells out to the git binary<br>to actually store files. Hosting git repos becomes one of the most stateful<br>services in our stateless cloud-native environment.

Sure git is in-theory decentralized, but most of us have ended up using that to<br>put our git repositories in one big store that has questionable uptime<br>practices: GitHub. To be fair to hubbers, GitHub operates at a scale that none<br>of us can really think about. They’ve been pushing the limits since their<br>inception where they had to get Engine Yard to keep building them bigger servers<br>to handle the load. They have to do everything with a big mounted filesystem<br>because git’s tooling gives them no other option.

A travesty of horrors beyond human comprehension​

Now suppose this weirdness bothers you enough to do something about it. To build<br>a git server without storing everything in the local filesystem, you have to<br>speak git somehow, and the conventional options aren’t really all that great:

If you shell out to the git binary, now your “library” is the argv of the git<br>process and your error handling is screen-scraping output. Internally, git<br>implements its functionality with a billionty subcommands rather than exposing<br>it all as a library. The codebase is held together by load-bearing calls to<br>die(), which kills the process.

If you link into git’s guts with libgit, you inherit the “when things go<br>bad, die()” behaviour and your app now suddenly starts crashing at random.<br>This is not good for uptime.

If you try to use libgit2 (the...

objects filesystem bucket tigris files storage

Related Articles