Microsoft’s Performance Contributions to Git in 2017

Derrick

January 11th, 2018

Visual Studio Team Services (VSTS) hosts the largest Git repository in the world: the Windows source code. Keeping a primary copy of the code available in the cloud and having it be performant while being updated by over 4000 users at the same time is a monumental achievement, but it is only useful if engineers can use the core Git client on their machines. We made this possible by building GVFS.

The Windows repository is larger than any other Git repository by orders of magnitude, and that exposed a few performance issues in core Git that we needed to fix to make it work with the large repositories we see at Microsoft. Thanks to Git being open source, we are improving Git for all users, on all platforms by contributing these modifications back.

Looking back at what we built in 2017 and how far we’ve come, I wanted to share details of some of my favorite patches that we’ve worked on with the Git community over the past year.

The Index

The Git index is a list of all files in the current and the expected object hash based on the current staging area. Many Git operations load this index into memory before performing the requested action. We found several ways to speed up index interactions.

The index is an ordered list of paths. On each index load, Git checked to be sure the list was still ordered. By skipping this check we can speed up the index load by 18%. When the index is rebuilt, the paths are written in the correct order. Git checks for duplicates on insertions, but any duplicates appear consecutively. By checking the last entry before performing binary search, we sped up index writing by up to 20%. We also reduced how often Git would discard and reload the index.

We also contributed micro-optimizations that sped up every index read or write, including using a hashmap instead of a list when computing merges and using the stack instead of heap-allocations.

Status and Checkout

Two frequently-used Git commands are status and checkout . status examines the state of the working directory to see what is different from the current HEAD while checkout updates the working directory to match a new HEAD . These operations are called frequently but are also very expensive when working on large repositories.

Many tools, such as Visual Studio Team Explorer, use status to present the list of changes available to commit. Many projects have large directories filled with build artifacts that are ignored by status due to .gitignore files. Team Explorer uses special flags to status to show these ignored files, but that can be a much larger list than the important files. We added new flags to status to make this call faster and now other tools can use these options, too. While we were looking at that code, we found ways to improve performance of git status –ignored by up to 50%.

Even with these speedups, we still need to walk the filesystem to find the current state of the written files. At least, we did need to. We added a file-system monitor plug-in to git that supplies git with an external command that presents a snapshot of the file system changes. While we are focused on providing integration with GVFS, this can work with tools like Watchman, too.

Abbreviations

Many Git commands present object hashes in abbreviated form for easier reading by a human. These abbreviations need to be long enough to uniquely identify a single object in the repository. For large repos, calculating abbreviations became a significant portion of the cost of common commands. The old algorithm tested a guessed abbreviation length by iterating through all objects that started with the abbreviated hash, then increased the length by one until only one object remained. The new algorithm finds the correct length by performing efficient binary searches for the closest matches and computing the common prefix lengths. This change speeds up commands like git log --oneline by 5% on the Linux repository.

Git stores its objects in two ways: packfiles and loose objects. Loose objects are stored as one-object-per-file in the .git/objects/?? directories where ?? stands for the first two hex digits of the object hash. When GVFS downloads an object on-demand, it places it as a loose object. Most repositories do not have many loose objects before Git automatically repacks into packfiles, but in the GVFS case the repos can contain millions of loose objects. When computing abbreviations, Git creates an in-memory cache that lists all loose objects in each of these directories. When creating that list, the list of strings was created by calling a generic string-format method to append strings. This took up to 12% of CPU time in some cases. It was easy to replace this with a simple append method for an easy performance win.

Just Getting Started

These were not the only improvements we made in 2017. You can read these earlier posts to learn more:

Microsoft made a big bet on Git, making it the primary version control system for Microsoft projects, hosted by VSTS. We’ve increased our investment in making Git better for everyone and will continue with some big improvements in the coming months and years – thanks to everyone in the Git community who have been incredibly willing to work with us and help us get these patches reviewed and contributed back upstream.