Git Source Code Review: Diff Algorithms

To find the differences between two "things" is a difficult task that Git has to perform very often:

Find the differences between two binaries files in order to generate a delta.

Find the differences between two text files in order to merge properly.

Find the differences between two text files in order to display it.

Find the differences between two commands in order to find typos.

Here is a summary of all the algorithms used with an associated genesis.



Part 1: Overview

Part 2: Genesis Part 3: Architecture Part 4: Algorithms for DIFF





Diff on text

Find the difference between text files is something Git has to do the most commonly :

Look at the difference between index and working directory: git diff file.txt .

. Compare the change between two commits: git diff commitBefore commitAfter

Generate a patch : git format-patch master --stdout > mypatch.patch .

The first implementation simply leveraged the system diff executable via a call to popen in show-diff.c . diff is a very famous and ubiquitous tool in the Linux world originally developed in the early 1970 for Unix. Its first version used Hunt–McIlroy algorithm. The core algorithm was later notoriously improved thanks to the work of Eugene W. Myers and Webb Miller, work extensively document in the papers: An O(ND) Difference Algorithm and its Variations by Eugene W. Myers and A File Comparison Program by Webb Miller and Myers.



The two most important metrics for diff operations are:

Speed Quality

Effort on Speed :



System diff not only consumed a fork / exec , it also required a lot of useless access disks because of temporary data. A builtin diff module was needed and implemented in March 2006 patch which introduced the xdiff folder containing a simplified version of libxdiff:

Since libxdiff used Eugene W. Myers's algorithm, the diff strategy was called myers . To this day it is still the default way git perform all diffs on text.



Effort on Quality :



If speed is an easy metric to understand and benchmark, quality is subjective and not intuitive. Take for example two versions of a file where a developer simply added a method functhreehalves in the middle. There are two way to reconstruct the change but one is of much higher quality because it clearly shows what was added where the other seems to indicate something was modified.





$diff -y file2 file1 void func1() { void func1() { x += 1 x += 1 } } > void functhreehalves() { > x += 1.5 > } > void func2() { void func2() { x += 2 x += 2 } } $diff -y file2 file1 void func1() { void func1() { x += 1 x += 1 } > } > > void functhreehalves() { > x += 1.5 } void func2() { void func2() { x += 2 x += 2 } }

The discussion started in [PATCH 0/3] Teach Git about the patience diff algorithm where quality between git's myers and bazar's patience were compared. Since patience algorithm seemed to provide a better quality on specific input it was shortly added in a patch that modified further xdiff :

You can find more documentation on patience algorithm from the author himself Bram Cohen (yes, the guy who also invented Bittorrent): Patience Diff Advantages and Patience Diff, a brief summary.



Further work was done by people from jgit project (which is a Java/Eclipse based). Their histogram based technique was supposed to be faster than myers while still based on based on Cohen's four rules found in patience (best of both world). A patch was eventually submitted:

Note :

myers

histogram

patience

Note :

minimal

Minimal

Trivia :

Trivia :

GIT_EXTERNAL_DIFF

main (git.c) cmd_diff (builtin/diff.c) builtin_diff_files (builtin/pack-objects.c) run_diff_files (diff-lib.c) xdi_diff_outf (xdiff-interface.c) xdi_diff (xdiff-interface.c) xdl_diff (xdiff/xdiffi.c) xdl_do_diff (xdiff/xdiffi.c) xdl_do_patience_diff xdl_do_histogram_diff meyes meyes mininmal

Diff on binaries (deltas)

So which diff strategy should be used? I personally have always used the default ().andseem to provide better output on specific case but they are so marginal and poorly documented that it is impossible to know when they will be good withiout trial/errors. If non-obsolete test cases could be provided users could probably make educated decisions.There is a fourth diff strategy calledwhich is actually an extended version of myers: If no minimal diff is requested the myers algo simply returns . But if minimal is requested, the algorithm will iterate until a certain amount of CPU time has been consumed and keep the "best" diff. Best being the smallest diff.was always part of xlibdiff and came with the first "speed" patch mentioned earlier.This need for speed also triggered the creation of a performance testing framework for diff strategies.diff is such an important feature that git developer even allow user to use their own external custom engine via thevariable ( Using Git diff to detect code movement ).To finish on text diff, and example of trace trace execution that I found useful:

The initial architecture of Git was: one text file => one sha1 => one deflated blob. The only drawback was HD storage consumption but "disk is cheap" was a fair approach and things worked well this way for a while. But big repositories suffered when being cloned across a network and all those blobs required a lot of disk access.



The topic RFC: adding xdelta compression to git raised those issues and in the end it was decided that storage was cheap but bandwidth and I/O were not. The xdelta lib was briefly considered but a customized version of xlibdiff was favored and added to xdiff.

Since binary files miss the CR character used to build atomic units in text algorithm (lines), different heuristic can be used in order to cut the stream. Set length of 48 bytes is an approach but git used adler32 before being replaced later with Rabin's Fingerprinting by random polynomials in an other patch:



Rabin's finger print technique was a big improvement since the sliding window allowed to detect short addition at the beginning of a zip and keep deltas tiny.



To finish with binary diff, here is the execution path of pack creation :

main (git.c) cmd_pack_objects (builtin/pack-objects.c) prepare_pack (builtin/pack-objects.c) ll_find_deltas (builtin/pack-objects.c) threaded_find_deltas (builtin/pack-objects.c) find_deltas (builtin/pack-objects.c) try_delta (builtin/pack-objects.c) create_delta (diff-delta.c)





Diff on command-line

git sometimes is not interested by the exact difference but rather by the distance between two strings. In 2008 was introduced a "suggestion" feature :

That is how the dispatcher seen in the architecture part is able to provide suggestions when a typo is suspected :

$git adt test.txt git: 'adt' is not a git command. See 'git --help'. Did you mean this? add

Word suggestion are generated by calculating an edit distance for each known builtin command using levenshtein distance . Found in levenshtein.c , it is a nice example of dynamic programming.

diffcore, xdiff, xdelta ... what ?

Just to summarize (because I was confused when I started) :

xdelta: A binary diff library which was considered but NOT used in Git.

xdiff: The text diff module of git. Originally based on libxdiff butthe current code in Git doesn't bear any resemblance with the libxdiff code at all. However the basic algorithm behind both implementations is the same..

butthe current code in Git doesn't bear any resemblance with the libxdiff code at all. However the basic algorithm behind both implementations is the same.. diffcore: A module of git that transforms the diff format after it has been generated.

Next

To be published: Git internal algorithms for graphs and merge operation.

@