GitHub Handles Git LFS

b00merang1 pts0 comments

How GitHub handles Git LFS - Scott Berrevoets

Snapshot tests have become an increasingly popular tool for mobile<br>teams to verify UI changes, but one question all teams run into is: where<br>do we store hundreds if not thousands of PNG files? Git Large File Storage<br>(LFS) is one of the more common ways to do this for various reasons1.

At Speak we recently went down this same path and in doing so learned a few<br>things about how Git LFS works and how GitHub handles it. Initially we just<br>learned the basic premise: instead of checking in a file, check in a small<br>human-readable pointer file that identifies the large file:

version https://git-lfs.github.com/spec/v1<br>oid sha256:907b5c652cce59e009e3c7fb2dc92d3bf598251315bed741aee037f2046bd32e<br>size 251222

The oid is an object ID for the file that's fetched from the configured LFS<br>server. Use git lfs track ./Snapshots/*.png to specify snapshots should be<br>stored in LFS and then commit .gitattributes to preserve that setting.

But soon we started getting warnings from GitHub that we were exceeding our<br>Git LFS budget. In investigating, we learned five more things that formed<br>our mental model around how to think about Git LFS when using GitHub:

GitHub charges for storage and bandwidth : similar to S3, both storage<br>and bandwidth cost money. Our snapshots don't take up too much storage but CI<br>clones the iOS and Android repos for every build, which is when all PNGs are<br>downloaded, incurring bandwidth usage.

You can skip downloading certain LFS objects to avoid bandwidth usage : to<br>avoid this, Git LFS offers a clone option to avoid downloading large files<br>when they aren't needed through git clone --config<br>lfs.fetchexclude="./Snapshots/*.png". This saves bandwidth, clone time, and<br>local storage. git lfs pull --include="./Snapshots/*.png" lets you<br>redownload them.

Removing a tracked file only removes it from git, not from storage :<br>removing the pointer files that get checked in and committing that change<br>does nothing to the actual object in storage. Removing the file from git<br>leaves a dangling object that still takes up storage space.

Fully removing references requires rewriting history : removing a<br>file from git isn't sufficient because older commits may still refer to an<br>LFS file and fail future clones. Fully purging it requires rewriting the git<br>history using git filter-repo, which is obviously not ideal but arguably<br>still better than irreversibly not being able to check out certain commits<br>in the repo ever again.

Rewriting history doesn't delete the file from storage : even after all<br>references to a file are deleted, the LFS object itself still needs to be<br>deleted as well for storage to be freed up. The only way to do this other<br>than to contact support, is to delete the repo (and all its associated data)<br>and recreate it.

All of these details make it clear that Git LFS isn't a regular git repository<br>that handles large files better; there's a sophisticated cloud storage layer<br>behind it that needs the same considerations that more mainstream cloud storage<br>does. Although git does the version control of those files, it doesn't touch the<br>raw objects itself.

Checking in large binary files that change regularly causes repo<br>bloat, and cloud storage solutions like S3 or GCS make for a more<br>complicated setup. ↩

storage file github from files large

Related Articles