One way to achieve higher precision is to first capture as much equality as possible. For everything else that is left over prioritize unique items first and report these items as differences. Finally determine the next bit of equality or uniqueness and everything in between is either a change, an insertion, or a deletion.

False negatives, which is allowing differences through without detection, is really bad. This is absolute failure in a diff algorithm. False positives, which is identifying more differences than there actually are is also bad, but a false positive is much better than a false negative. This means it is safer to report more differences than are actually present, which is why a higher number of reported differences is less precise. Maximum precision is reporting differences without any false negatives or false positives.

The primary priorities when writing this kind of code are execution speed, precision (as previous described), and code simplicity. In most cases precision is the most important factor in code design, but in some cases speed is more important when processing large input or a batch of thousands of files. Simplicity is necessary so that other people can understand the code and modify it as necessary to improve upon the design and additional features. No algorithm is ever capable of meeting all needs for all use cases, so it is important that other people can understand the code with minimal effort.

Speed

Faster execution is the result of a couple of things. The most important consideration for making an algorithm faster is to reduce the number of passes through data where possible. After that eliminate all costly operations. In most programming languages simple arithmetic and static string comparisons are cheap to execute particularly if the examined strings aren't changing. The theoretical minimum number of data passes is two as you have to read the contents of each submitted sample. Pretty Diff achieves speed in its algorithm by only 3 complete passes through data and taking all possible effort to never repeat a step or loop iteration. The Pretty Diff approach is linear and predictable where the number of interations passing through data is computed as: number of iterations from the first sample + number of iterations from the second sample + the number of iterations from the smallest of those samples. Performance of the mentioned approach is heavily based upon key access in a hash map, which will likely vary by programming language.

The theoretical minimum number of data passes is two as you have to read the contents of each submitted sample. Until we discover a way to perform a comparison without reading from the samples we can safely assume 2 data passes is the fastest possible approach. Between that and the Pretty Diff approach there are two possibilities for increased performance. The first possibility is to make decisions and write output immediately upon reading from the second sample so as to have only two data passes. The challenge with this approach is that analysis occurs in the moment of the current loop interation without knowledge of what comes later in the second sample. A more practical second performance possibility is to write a smaller hash map. Writing a smaller hash map means moving some of the decision tree up front before a separate formal analysis phase. In order for this approach to be viable this early step in logic must be tiny, non-replicating, and a different means of iteration must be devised to account for a data store of unpredictable size.

This page blew up on Hacker News recently and many comments suggested this approach could not possibly be faster than the Myers' O(ND) approach. That may or may not be true and no evidence was provided either way (conjecture is not evidence).

In terms of experimental criteria algorithms are themselves largely irrelevant. More important is the implementation of those algorithms. If exercising the Myers' approach makes fewer decisions and has fewer total data passes, as described in the previous paragraph, then it likely is faster. I am calling out the word total because this makes all the difference. Many of the diff applications I looked at don't provide a complete third data pass. Instead they provide the minimum two complete data passes over the samples and various smaller passes as calculated by block moves and edit distances. If these various fractional passes are non-linear, which means starting and stopping in different places respectively, their performance is less predictable. If these fractional passes are non-linear and touch any given data index more than once they have repetitive logic, and likely are not as fast. To affirmatively guarantee superior performance over the Pretty Diff approach there needs to be fewer passes over data, which means no repetition and a smaller number of iterations. Preditability ensures the performance of an application scales roughtly proportionately to the size of the provided samples, where an unpredictable approach would scale disproportionately (slower over time). I say roughtly because things in physical reality always mess this up like: solar flares, memory block limitations, CPU heat, and so forth.

I believe the approach taken here is fast. I honestly cannot say, scientifically, it is the fastest ever (or slowest) approach for its level of accuracy without also writing alternate algorithms into applications with identical application constraints. Neither can anyone else. I can safely say this approach is the fastest ever comparative algorithm for its level of predictability and precision.