About a year ago, we wanted to publish a central Helm chart repository. It seemed like the easiest way to do that was from a single source, so we migrated all of our Helm charts to a central Git repository. The idea was to use CircleCI to build every chart, then upload the resultant charts to S3 and serve them from there. It wasn’t a perfect solution, though, since it made coordination with application releases and tracking issues more difficult.

Fast forward one year, and we’ve published our free Helm Chart repository service. We build our products based on customer needs, which often overlap with our own, so we naturally came to the solution of migrating the static S3 repository to this service. We ended up with a simpler and faster build flow, and were finally able to get rid of the central Git repository, and to store application charts next to the applications themselves.

Unfortunately, migrating existing charts wasn’t as easy as we’d first imagined. If we were migrating simple charts, we just moved the files to the application repo: job done. If we were migrating popular, open source charts, however, that wasn’t an option.

We wanted to preserve the history of the charts, so that we could make sure our contributors got the credit they were due. We considered using subtree merges, but that would have showed up as a single commit in the file history. That wasn’t what we wanted; our goal was to merge the relevant history from one repo into another. Furthermore, we wanted to move the charts from their root repo into a subdirectory ( charts/ ) within the application repo, but again, with their file history preserved. Quite a challenge!

In order to understand how we found a solution, we have to first talk about Git commits and Git history.

Git commits and history 🔗︎

Most people interact with a Git repository through commits. More precisely, we usually commit changes which we call diffs when comparing them to previous states. When we think about the history of a repository, then, we see a tree of commits, each storing a diff.

When we want to alter that history, we usually rebase a branch - or a set of commits - onto another branch. This means that we reapply the diffs onto another branch of the tree. Whenever there is a point where a diff cannot be applied automatically, Git marks it as a conflict, which must be resolved manually.

This should be sounding pretty familiar to anyone who has ever used Git.

Here’s the interesting part. What if I told you that commits aren’t actually diffs, but snapshots of every file along given points in their history? Although the second part isn’t technically true (think about it: repositories would be much-much bigger that way), commits aren’t technically diffs either.

If I had to compare them to something, I would say they’re very similar to copy-on-write: changed files are copied, while unchanged files are shared with a previous version. There are other optimizations and helpful techniques here, which are totally irrelevant to our needs; the important thing to remember, is that you should consider commits to be both diffs and snapshots.

With rebase we can work with commits as diffs. But that isn’t really what we want. We need a way to filter commits related exclusively to a set of files, then move all those files to a subdirectory. And does that have anything having to do with diffs? We could probably write a script that uses rebase to select diffs related to our charts (and drop every other commit, rewrite related commits and drop unrelated code from them), but that would shorten our lifespan to within a few years from now.

What we want is to remove unnecessary files from every commit, and relocate every chart in every commit to a new subdirectory. But how does the “snapshot model” for commits provide a solution? If every commit is a snapshot, it means that, whatever changes I make to a file with a commit, the next commit will simply overwrite them. This ensures that modifying a commit as a snapshot will never create a conflict, because the next commit has all the content.

But if I change the file with every commit, in the end, I’ll see my changes at the top of my branch. Not only that, but applying the same changes to every commit ensures that the “diff model” stays intact; if I move a file to a subdirectory with each new commit, the diff can properly be calculated between commits.

Fortunately, Git gives us a tool to work with commits as “snapshots”. It’s called git filter-branch .

I’m not going to go into detail about everything git filter-branch can do, because that would be enough to fill another blog post (or two). For now, let’s think of it as a git rebase , but instead of diffs it works with snapshots. A major difference between filter-branch and rebase is that, while you would normally rebase commit by commit, when working with snapshots, you’ll want to apply certain changes to all or some of them based on differing criteria. See the the explanation above.

Now, we want to do three things:

remove unrelated charts from the repo (to create a clean working copy)

move every chart into a charts/ directory

directory merge the whole tree into the application repository

I’ll show you step-by-step how I did this for three charts from our Bank-Vaults project. We are going to move the following charts to the application repository:

vault

vault-operator

vault-secrets-webhook

The first step is to checkout a new working copy of the central Git repository:

git clone git@github.com:banzaicloud/banzai-charts.git cd banzai-charts git checkout 38f537804f953c986ad1796bddd4848a7559af98 git checkout -b migrate

The next step is to remove everything else, so that only the three charts remain in the repository:

git filter-branch --prune-empty --index-filter \ 'git ls-files \ | grep -v "^vault/\|^vault-operator/\|^vault-secrets-webhook/" \ | xargs --no-run-if-empty git rm --cached' \ HEAD

Note: On macOS you might want to install the GNU version of xargs by running brew install findutils , which will be installed as gxargs .

Okay, let’s take a look at what happened here. We used a so-called index-filter which executes a command for every commit, but does not unwrap git objects, so is much faster than filters that do. The command following the filter lists all files in the repository, and excludes the three charts that we want to keep and deletes the rest.

Removing files from every commit will result in a repository that only contains the desired files… and tons of empty commits. Since many snapshots will be completely identical to previous ones (because we deleted the content they were supposed too change), Git will show them as empty commits. To make sure we only keep the relevant commits (ones that change one of the three charts), we used the --prune-empty option, which does exactly what its name would suggest: deletes commits which become empty as a result of applying the filter. The HEAD at the end of the command tells it to run the filter on the current HEAD.

After running the above command, you should see only the three desired directories, and about 120 commits (instead of ~1000).

So far so good. Now, let’s move the charts to a charts/ subdirectory. We’ll use filter-branch for this as well:

git filter-branch -f --tree-filter \ 'mkdir -p /tmp/chart-migration; \ mv * /tmp/chart-migration; \ mkdir charts; \ mv /tmp/chart-migration/* charts/' \ HEAD

Here, we’re using tree-filter , because we need to move files around. Remember, the index-filter operates on git objects and we used Git commands to delete files, but we can’t do that here (or at least that would be a bit tricky). For that reason, this command would usually be much slower than the one we just discussed, but we’ve already gotten rid of a lot of commits, so it should finish relatively quickly.

Notice the -f option at the beginning of the command. git filter-branch creates a running backup and will refuse to overwrite that backup, unless you explicitly tell it to.

Once the command finishes, you should see a single charts/ directory in the repository.

One last step remains, merging the slimmed down chart repository into the application repository:

cd .. git clone git@github.com:banzaicloud/bank-vaults.git cd bank-vaults git checkout f618ec522f674263da1e7a2de74661fa737f2cdb git checkout -b migrate git remote add charts ../banzai-charts git fetch charts migrate git merge --allow-unrelated-histories charts/migrate

That’s it, you’ve successfully merged the relevant charts with history into the application repository.

Further reading 🔗︎

After I wrote the majority of this post, I found another, which goes into further detail about commits and git filter-branch :

https://manishearth.github.io/blog/2017/03/05/understanding-git-filter-branch/

It will also help with your everyday git-fu, and help you understand some of the most important data structures behind git. Imagine how nice it would have been if I found it before starting this whole migration. :)

Similarly, Git for computer scientists is a must-read.

Banzai Cloud is changing how private clouds are built: simplifying the development, deployment, and scaling of complex applications, and putting the power of Kubernetes and Cloud Native technologies in the hands of developers and enterprises, everywhere.