What is Git made of?What is Git made of?<br>Git can be confusing. Git can be scary. Git CLI may be the least intuitive tool you have to use on a daily basis.<br>But also Git is a wonderfully simple and cleverly designed version control system that definitely deserves its popularity.<br>To prove this point I invite you to implement your own tiny Git that would be able to create a local repository, commit a single file to it, view commit logs and checkout a certain revision of that file.<br>It won’t be more than a couple hundred lines of code, we’ll try to keep things as simple as we can. Code examples would be in Go, but any other language is suitable for this tutorial, too.<br>git init<br>What turns an empty directory into an empty Git repository? You probably have noticed that Git stores all its internal data in a hidden directory .git. In fact, there are only a few special files/folder there that have to be created to let Git CLI treat it as a perfectly valid, empty repository:<br>$ mkdir -p .git/objects/info .git/objects/pack .git/refs/heads .git/refs/tags<br>$ echo "ref: refs/heads/main" > .git/HEAD<br>$ tree .git<br>.git<br>├── HEAD<br>├── objects<br>│ ├── info<br>│ └── pack<br>└── refs<br>├── heads<br>└── tags<br>$ git symbolic-ref --short HEAD<br>main<br>$ git log<br>fatal: your current branch 'main' does not have any commits yet
By using a couple of shell commands we’ve tricked Git into recognising our empty repository with a single main branch and no commits. But what is stored in these directories we’ve created?<br>objects<br>Almost everything in Git is stored as an object: every source file that you commit becomes a blob object, every commit itself is an object, tags are objects, too.<br>For example, we have committed a file.txt with the contents hello\n (6 bytes). This would create 3 objects: a blob (actual file contents), a tree (a list of file names and permissions), and a commit (a reference to the committed tree with some information about the committer, timestamp etc).<br>For every object Git stores its object type (“blob”, “tree” or “commit”) and a length in bytes. So our hello\n content would actually become blob 6\0hello\n object data. Additionally, Git uses compression to save disk space, so our object data will be compressed using zlib algorithm before being written to disk as a special file inside .git/objects.<br>hashes<br>Before we dive into the details of writing objects let’s talk about Git hashes.<br>Every object is uniquely identified inside a Git repo by the SHA hash of its contents. Originally Git was using SHA-1 hashing algorithm, but recent versions of switched to SHA-256 to reduce hash collisions. However, SHA-1 is still widely used in many modern Git setups, and we’ll be using it here as well.<br>Let’s get back to our file.txt with hello\n content. After being compressed the contents of that blob object could look like this (using a simple python one-liner for zlib compression):<br>$ python3 -c 'import sys,zlib; sys.stdout.buffer.write(zlib.compress(b"blob 6\0hello\n",6))' | hexdump -C<br>00000000 78 9c 4b ca c9 4f 52 30 63 c8 48 cd c9 c9 e7 02 |x.K..OR0c.H.....|<br>00000010 00 1d c5 04 14 |.....|<br>00000015
In practice various zlib implementations may use different compression levels and settings, so resulting encoded content may look different. However the SHA-1 hash is calculated from the uncompressed raw data of an object and would always be the same:<br>$ printf "blob 6\0hello\n" | sha1sum<br>ce013625030ba8dba906f756967f9e9ca394464a -
Now let’s compare that with Git CLI results in some dummy repo:<br>$ mkdir hello<br>$ cd hello<br>$ git init<br>$ echo "hello" > file.txt<br>$ git ci -m 'initial commit' file.txt<br>$ git cat-file blob ce013625<br>hello<br>$ hexdump -C .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a<br>00000000 78 01 4b ca c9 4f 52 30 63 c8 48 cd c9 c9 e7 02 |x.K..OR0c.H.....|<br>00000010 00 1d c5 04 14 |.....|<br>00000015
Git uses a little optimisation when it stores objects: the first two digits of a hash become the subdirectory name and the rest becomes the file name where the compressed object data is stored. Let’s reproduce this behaviour:<br>$ mkdir -p .git/objects/3a # first two digits: "3a"<br>$ printf "\x78\x9c\x4b\xca\xc9\x4f\x52\x30\x63\xc8\x48\xcd\xc9\xc9\xe7\x02\x00\x1d\xc5\x04\x14" \<br>> .git/objects/3a/3cca74450ee8a0245e7c564ac9e68f8233b1e8 # rest of the hash<br># Now, can Git CLI read our blob?<br>$ git cat-file blob 3a3cca<br>hello
Writing objects<br>First of all let’s introduce a Git “class” that would be a main entry point to work with out repo. We will also need a Hash type that would handle hash encoding/decoding:<br>type Git struct {<br>Dir string // where `.git` is located<br>Branch string // current branch, i.e. "main"<br>...
type Hash []byte // hashes in Git are presented in hexadecimal form
func NewHash(b []byte) (Hash, error) {<br>dec, err := hex.DecodeString(strings.TrimSpace(string(b)))<br>if err != nil {<br>return nil, err<br>return Hash(dec), nil<br>func (h Hash) String() string {...