Git is a distributed version control system (DVCS) that we use every day to manage our code. It is a powerful tool but have you ever wondered how it works its magic? The Git internal docs can be intimidating, incomplete, and don’t have examples. Digging through the Git’s implementation can also be intimidating, particularly if you aren’t familiar with C.

Pulling apart the engine and putting it back together is one of the best ways to understand how a system works. However, instead of writing C, let’s use something more familiar to us as Rails developers. Let’s re-implement Git in Ruby!

If you want to dig deeper into the implementation, check out the RGit source on Github.

Git is built in modular fashion following the UNIX philosophy of small, sharp tools. Each command is its own script file and the top level git command simply proxies to them. Git ships with a number of built-in commands but custom commands can be written as long as they follow a given naming convention.

#!/usr/bin/env ruby # bin/rgit command , * args = ARGV if command . nil? $stderr . puts "Usage: rgit <command> [<args>]" exit 1 end path_to_command = File . expand_path ( "../rgit- #{ command } " , __FILE__ ) if ! File . exist? path_to_command $stderr . puts "No such command" exit 1 end exec path_to_command , * args

This script does one of three things when we call it:

Outputs usage information if no subcommand was given

Outputs an error message if no script for the subcommand was found

Runs the given subcommand if it is found

Notice that we pass on any additional arguments to the subcommand.

As good UNIX citizens, we output messages to the standard error stream and return a non-zero exit code when errors occur.

Git stores all of its data and metadata in a .git directory in the root of the repository. The git init command initializes the .git directory and a few subdirectories as follows:

.git ├── HEAD ├── config ├── objects │ ├── info │ └── pack └── refs ├── heads └── tags

HEAD is a file that has the hard-coded value ref: refs/heads/master . We’ll need this file later. config contains configuration for the repo. We’ll ignore it for now in the interest of simplicity. The remaining items in the tree are empty directories.

Generating this structure is mostly a lot of calls to Dir.mkdir

#!/usr/bin/env ruby # bin/rgit-init RGIT_DIRECTORY = ".rgit" . freeze OBJECTS_DIRECTORY = " #{ RGIT_DIRECTORY } /objects" . freeze REFS_DIRECTORY = " #{ RGIT_DIRECTORY } /refs" . freeze if Dir . exists? RGIT_DIRECTORY $stderr . puts "Existing RGit project" exit 1 end def build_objects_directory Dir . mkdir OBJECTS_DIRECTORY Dir . mkdir " #{ OBJECTS_DIRECTORY } /info" Dir . mkdir " #{ OBJECTS_DIRECTORY } /pack" end def build_refs_directory Dir . mkdir REFS_DIRECTORY Dir . mkdir " #{ REFS_DIRECTORY } /heads" Dir . mkdir " #{ REFS_DIRECTORY } /tags" end def initialize_head File . open ( " #{ RGIT_DIRECTORY } /HEAD" , "w" ) do | file | file . puts "ref: refs/heads/master" end end Dir . mkdir RGIT_DIRECTORY build_objects_directory build_refs_directory initialize_head $stdout . puts "RGit initialized in #{ RGIT_DIRECTORY } "

This script is called rgit-init in keeping with the conventions expected by the rgit command we built. If there is already a .rgit directory, we output an error message and exit with a non-zero exit code. Real Git allows you to safely “re-initialize” a repository but let’s opt out of this edge case for our MVP.

The init command is a little verbose but very boring. It creates a bunch of directories as well as the HEAD file.

Git allows capture a snapshot of the current state of a file via the git add command. The set of these snapshots is called the staging area. A list of snapshots and their metadata is stored at .rgit/index . Staging a file takes a few steps:

Create a SHA based on the file contents

Create a blob by compressing the file contents

Save that blob as rgit/objects/<first-two-characters-of-sha>/<rest of sha>

Add the SHA and original file path to the index so we can retrieve it later.

The index is a binary file that has the following format:

DIRC <version_number> <number of entries> <ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path> <ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path> <ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path> # more entries

A lot of this metadata comes in handy for calculations done by other commands. If you try to open this file however, you will see a bunch of gibberish.

cat .git/index

bin/rgit-initTREE52 1?Ibin/rgitU?U?2???? ??? C??B=????''9bin2 0 ?Cԣ̏k?i??`V:??3'9Z?6??赠xa?cǢbF

This is because the contents of the index file is stored as a binary format for performance reasons.

For simplicity and human-readability, let’s ignore most of the metadata and use a text format. We can return and add these features as they become necessary in the future.

For now, RGit’s index format will look like:

<SHA> <path> <SHA> <path> <SHA> <path> # more entries

Let’s look at the actual Ruby code to do all this!

#!/usr/bin/env ruby require "digest" require "zlib" require "fileutils" RGIT_DIRECTORY = ".rgit" . freeze OBJECTS_DIRECTORY = " #{ RGIT_DIRECTORY } /objects" . freeze INDEX_PATH = " #{ RGIT_DIRECTORY } /index" if ! Dir . exists? RGIT_DIRECTORY $stderr . puts "Not an RGit project" exit 1 end path = ARGV . first if path . nil? $stderr . puts "No path specified" exit 1 end file_contents = File . read ( path ) sha = Digest :: SHA1 . hexdigest file_contents blob = Zlib :: Deflate . deflate file_contents object_directory = " #{ OBJECTS_DIRECTORY } / #{ sha [ 0 .. 1 ] } " FileUtils . mkdir_p object_directory blob_path = " #{ object_directory } / #{ sha [ 2 ..- 1 ] } " File . open ( blob_path , "w" ) do | file | file . print blob end File . open ( INDEX_PATH , "a" ) do | file | file . puts " #{ sha } #{ path } " end

Let’s start versioning Rgit with Rgit! First we need to add a file to the staging area:

rgit add bin/rgit

Our .rgit directory now looks like:

.rgit ├── HEAD ├── index ├── objects │ ├── b3 │ │ └── 02dd6f8cd2b385b170e78c14503342c0ba6ae8 │ ├── info │ └── pack └── refs ├── heads └── tags

Notice that we now have a file in the objects directory. It contains the compressed source of bin/rgit .

Finally, our index looks like:

cat .rgit/index

b302dd6f8cd2b385b170e78c14503342c0ba6ae8 bin/rgit

Blobs are the contents of a particular file at a particular time. In order to capture a snapshot of the entire project, Git bundles a bunch of these into a commit.

In order to capture the directory structure of the project, Git creates a “tree” object for each directory of a project. Each tree object contains a list of the tracked files and their associated blob as well as tree objects for subdirectories.

This gives us a tree structure that mirrors the tracked project’s filesystem. Directories are represented by “tree” objects while files are “blobs”. This whole tree structure is then tied to a “commit” object so that we can refer to it later.

The commit command does three things:

Build the tree/blob structure Create a commit object that points to that structure Update the current branch to point to the this commit.

Because creating objects is a common task, I’ve extracted it to RGit::Object .

# lib/rgit/object require "fileutils" module RGit RGIT_DIRECTORY = " #{ Dir . pwd } /.rgit" . freeze OBJECTS_DIRECTORY = " #{ RGIT_DIRECTORY } /objects" . freeze class Object def initialize ( sha ) @sha = sha end def write ( & block ) object_directory = " #{ OBJECTS_DIRECTORY } / #{ sha [ 0 .. 1 ] } " FileUtils . mkdir_p object_directory object_path = " #{ object_directory } / #{ sha [ 2 ..- 1 ] } " File . open ( object_path , "w" , & block ) end private attr_reader :sha end end

This class handles all of the directory/path related tasks as well as opening the file. It then yields to the given block for the actual writing of the object’s contents.

With this refactor done, let’s take a look at the commit command:

#!/usr/bin/env ruby # bin/rgit-commit $LOAD_PATH << File . expand_path ( "../../lib" , __FILE__ ) require "digest" require "time" require "rgit/object" RGIT_DIRECTORY = " #{ Dir . pwd } /.rgit" . freeze INDEX_PATH = " #{ RGIT_DIRECTORY } /index" COMMIT_MESSAGE_TEMPLATE = <<- TXT # Title # # Body TXT def index_files File . open ( INDEX_PATH ). each_line end def index_tree index_files . each_with_object ({}) do | line , obj | sha , _ , path = line . split segments = path . split ( "/" ) segments . reduce ( obj ) do | memo , s | if s == segments . last memo [ segments . last ] = sha memo else memo [ s ] ||= {} memo [ s ] end end end end def build_tree ( name , tree ) sha = Digest :: SHA1 . hexdigest ( Time . now . iso8601 + name ) object = RGit :: Object . new ( sha ) object . write do | file | tree . each do | key , value | if value . is_a? Hash dir_sha = build_tree ( key , value ) file . puts "tree #{ dir_sha } #{ key } " else file . puts "blob #{ value } #{ key } " end end end sha end def build_commit ( tree :) commit_message_path = " #{ RGIT_DIRECTORY } /COMMIT_EDITMSG" `echo " #{ COMMIT_MESSAGE_TEMPLATE } " > #{ commit_message_path } ` `$VISUAL #{ commit_message_path } >/dev/tty` message = File . read commit_message_path committer = "user" sha = Digest :: SHA1 . hexdigest ( Time . now . iso8601 + committer ) object = RGit :: Object . new ( sha ) object . write do | file | file . puts "tree #{ tree } " file . puts "author #{ committer } " file . puts file . puts message end sha end def update_ref ( commit_sha :) current_branch = File . read ( " #{ RGIT_DIRECTORY } /HEAD" ). strip . split . last File . open ( " #{ RGIT_DIRECTORY } / #{ current_branch } " , "w" ) do | file | file . print commit_sha end end def clear_index File . truncate INDEX_PATH , 0 end if index_files . count == 0 $stderr . puts "Nothing to commit" exit 1 end root_sha = build_tree ( "root" , index_tree ) commit_sha = build_commit ( tree: root_sha ) update_ref ( commit_sha: commit_sha ) clear_index

This file does several things:

Exits with error code and message if there are no files to commit Creates all the necessary tree objects for the files in the index Creates a commit object pointing to the root tree object Updates the current branch to point to the commit Clears the index

Building the tree is done in two passes. First the index is converted into a hash structure representing the file tree. Secondly, this structure is converted to tree objects on the filesystem. Both steps are done recursively.

For the commit message, we simply open a file using the user’s $VISUAL editor. Once the user exit their editor, we read the file an put the contents into the commit.

Let’s see it all come togeter. Staging and committing bin/rgit and bin/rgit-add gives us the following results in .rgit :

.rgit ├── COMMIT_EDITMSG ├── HEAD ├── index ├── objects │ ├── 63 │ │ └── 45493c987e6144cc68142ad2405db681b28628 │ ├── 8c │ │ └── fe566596683acae588039156f40ecaff282c30 │ ├── ae │ │ └── 161568392ed9aa321466446a9bb01acb111e4f │ ├── b3 │ │ └── 02dd6f8cd2b385b170e78c14503342c0ba6ae8 │ ├── f9 │ │ └── 60e7d48c47e86289a653b0afc0b7a13a9d372e │ ├── info │ └── pack └── refs ├── heads │ └── master └── tags

In order to find the current state, we first look up what branch we are on by checking .rgit/HEAD . This points to .rgits/refs/heads/master , the master branch. The master branch points to its latest commit. The commit in turn points to a tree object representing the root of the project. This tree object points to another tree object representing the bin/ directory which in turn points to two blob objects containing the compressed contents of bin/rgit and bin/rgit-add at the time of the commit.

This structure of objects pointing to each other is what makes Git so powerful. By simply changing a few of these pointing files, we can switch to different points in history.