by Neil Fraser, April 2006
Computing the differences between two sequences is at the core of many applications. Below is a simple example of the difference between two texts:
Text 1: Apples are a fruit.
Text 2: Bananas are also fruit.
Diff: AppleBananas are also fruit.
This paper surveys the literature on difference algorithms, compares them, and describes several techniques for improving the usability of these algorithms in practice. In particular, it discusses pre-processing optimisations, strategies for selecting the best difference algorithm for the job, and post-processing cleanup.
Even the best-known difference algorithms are computationally expensive processes. In most real-world instances, the two sequences (usually text) being compared are similar to each other to a certain extent. This observation enables several optimisations that can improve the actual running time of an algorithm, and in certain cases, that can even obviate the need for running the algorithm altogether.
The most obvious and the simplest optimisation is the equality test. Since there is a non-trivial chance that the two sequences are identical, and the test for this case is so trivial, it is logical to test for this case first. One side effect of this test is that it may simplify subsequent code. After this test, there is guaranteed to be a difference; the null case is eliminated.
If there is any commonality at all between the texts, it is likely that they will share a common substring at the start and/or the end.
Text 1: The cat in the hat.
Text 2: The dog in the hat.
This can be simplified down to:
Text 1: cat
Text 2: dog
Locating these common substrings can be done in O(log n) using a binary search. Since binary searches are least efficient at their extreme points and it is not uncommon in the real-world to have zero commonality, it makes sense to do a quick check of the first (or last) character before starting the search.
(This section generates a lot of email. The issue is that string equality operations (a == b) are typically O(n), thus the described algorithm would be O(n log n). However, when dealing with high level languages, the speed difference between loops and equality operations is such that for all practical purposes the equality opeation can be considered to be O(1). Further complicating the matter are languages like Python which use hash tables for all strings, thus making equality checking be O(1) and string creation be O(n). For more information, see the performance testing.)
The GNU diff program (which does linear matching for prefixes and suffixes) claims in their documentation that "occasionally [prefix & suffix stripping] may produce non-minimal output", though they do not provide an example of this.
A very common difference is the insertion or the deletion of some text:
Text 1: The cat in the hat. | Text 1: The cat in the hat.
Text 2: The furry cat in the hat. | Text 2: The cat.
After removing the common prefixes and suffixes one gets:
Text 1: | Text 1: in the hat
Text 2: furry | Text 2:
The presence of an empty 'Text 1' in the first example indicates that 'Text 2' is an insertion. The presence of an empty 'Text 2' in the second example indicates that 'Text 1' is a deletion. Detecting these common cases avoids the need to run a difference algorithm at all.
Detecting and dealing with two edits is more challenging than singular edits. Two simple insertions can be detected by looking for the presence of 'Text 1' within 'Text 2'. Likewise two simple deletions can be detected by looking for the presence of 'Text 2' in 'Text 1'.
Text 1: The cat in the hat.
Text 2: The happy cat in the black hat.
Removing the common prefixes and suffixes as a first step guarantees that there must be differences at each end of the remaining texts. It is then easy to determine that the shorter string ("cat in the") is present within the longer string ("happy cat in the black"). In these situations the difference may be determined without running a difference algorithm.
The situation is more complicated if the edits aren't two simple insertions or two simple deletions. These cases may often be detected if the two edits are separated by considerable text:
Text 1: The cat in the hat.
Text 2: The ox in the box.
After removing the common prefixes and suffixes one gets:
Text 1: cat in the hat
Text 2: ox in the box
If a substring exists in both texts which is at least half the length of the longer text, then it is guaranteed to be common. In this case the texts can be split in two, and separate differences carried out:
Text 1: cat | Text 1: hat
Text 2: ox | Text 2: box
Performing this test recursively may, in general, yield further subdivisions, although there are no such subdivisions in the above example.
Computing the longest common substring is an operation about as complex as computing the difference, which would mean there would be no savings. However, the limitation that the common substring must be at least half the length of the longer text provides a shortcut. As the diagram below illustrates, if a common substring of such a length exists, then the second quarter and/or third quarter of the longest string must form part of this substring.
0 | ¼ | ½ | ¾ | 1 |
← Drag the 'half length' bar with your mouse. |
The smaller text can be searched for matches of these two quarters, and the context of any matches can be compared in both texts by looking for common prefixes and suffixes. The strings may be split at the location of the longest match which is equal to or greater than the half the length of the longer text. Due to the problem of repeated strings, all matches of each quarter in the smaller text must be checked, not just the first one which reaches the necessary length.
Once the pre-processing optimisation is complete, the remaining text is compared with a difference algorithm. A brute-force technique would take O(n1*n2) to execute (where n1 and n2 are the lengths of each input). Since this is clearly unscalable in practical applications where the text lengths are arbitrary, much research has been conducted on better algorithms which approach O(n1+n2). However, these algorithms are not interchangeable. There are several criteria beyond speed which are important.
There are three common modes for comparing input texts:
Text 1: The cat in the hat.
Text 2: The bird in the hand.
Char diff: The catbird in the hatnd.
Word diff: The catbird in the hathand.
Line diff: The cat in the hat.The bird in the hand.
Comparing by individual characters produces the finest level of detail but takes the longest to execute due to the larger number of tokens. Comparing by word boundaries or line breaks is faster and produces fewer individual edits, but the total length of the edits is larger. The required level of detail varies depending on the application. For instance comparing source code is generally done on a line-by-line basis, whereas comparing an English document is generally done on a word-by-word basis, and binary data or DNA sequences is generally done on a character-by-character basis.
Any difference algorithm could theoretically process any input, regardless of whether it is split by characters, words or lines. However, some difference algorithms are much more efficient at handling small tokens such as characters, others are much more efficient at handling large tokens such as lines. The reason is that there are an infinite number of possible lines, and any line which does not appear in one text but appears in the other is automatically known to be an insertion or a deletion. Conversely, there are only 80 or so distinct tokens when processing characters (a-z, A-Z, 0-9 and some punctuation), which means that any non-trivial text will contain multiple instances of most if not all these characters. Different algorithms can exploit these statistical differences in the input texts, resulting in more efficient strategies. An algorithm which is specifically designed for line-by-line differences is described in J. Hunt and M. McIlroy's 1976 paper: An Algorithm for Differential File Comparison.
Another factor to consider is the availability of useful functions. Most computer languages have superior string handling facilities (such as regular expressions) when compared with array handling facilities. These more powerful string functions may make character-based difference algorithms easier to program. On the other hand, the advent of Unicode support in many languages means that strings may contain alphabet sizes as great as 65,536. This allows words or lines to be hashed down to a single character so that the difference algorithm can make use of strings instead of arrays. To put this in perspective, the King James Bible contains 30,833 unique lines and 28,880 unique 'words' (just space-delimited, with leading or trailing punctuation not separated).
Traditional difference algorithms produce a list of insertions and deletions which when performed on the first text will result in the second text. An extension to this is the addition of a 'move' operator:
Text 1: The black cat in the hat?
Text 2: The cat in the black hat!
Ins & Del: The black cat in the black hat?!
...& Move: The ^cat in the black hat?!
When a large block of text has moved from one location to another, it is often more understandable to report this as a move rather than a deletion and an insertion. An algorithm which uses the 'move' operator is described in P. Heckel's 1978 paper: A technique for isolating differences between files.
An entirely different approach is to use 'copy' and 'insert' as operators:
Text 1: The black cat in the hat?
Text 2: The black hat on the black cat!
Copy & Ins: The black hat on the black cat!
This approach uses fragments from the first text, which are copied and pasted to form the second text. Much like clipping out words from a newspaper to compose a ransom note, except that the any clipped word may be photocopied and used multiple times. Any entirely new text is inserted verbatim. Copy/insert differences are generally not human-readable. However they are significantly faster to compute making them superior than insert/delete differences for delta compression applications. An algorithm which uses the 'copy' and 'insert' operators is described in J. MacDonald's 2000 paper: File System Support for Delta Compression.
No difference algorithm should ever return an incorrect output; that is, an output which does not describe a valid path of differences from one text to another. However, some algorithms may return sub-optimal outputs in the interests of speed. For instance, Heckel's algorithm (1978) is quick, but gets confused if repeated text exists in the inputs:
Text 1: A X X X X B
Text 2: C X X X X D
Optimal: AC X X X X BD
Heckel: A X X X X BC X X X X D
Another example of sacrificing accuracy for speed is to process the whole texts with a line-based algorithm, then reprocess each run of modified lines with a character-based algorithm. The problem with this multi-pass approach is that the line-based difference may sometimes identify inappropriate commonalities between the two lines. Blank lines are a common cause of these since they may appear in two unrelated texts. These inappropriate commonalities serve to randomly split up edit blocks and prevent genuinely common text from being discovered during the character-based phase. A solution to this is to pass the line-based differences through a semantic cleanup algorithm (as described below in section 3.2) before performing the character-based differences. In cases involving multiple edits throughout a long document, performing a high-level difference followed by a low-level difference can result in an order of magnitude improvement in speed and memory requirements. However, there remains a risk that the resulting difference path may not be the shortest one possible.
Arguably the best general-purpose difference algorithm is described in E. Myers' 1986 paper: An O(ND) Difference Algorithm and Its Variations. One of the proposed optimisations is to process the difference from both ends simultaneously, meeting at the middle. In most cases this improves the performance by up to 50%.
A perfect difference algorithm will report the minimum number of edits required to convert one text into the other. However, sometimes the result is too perfect:
Text 1: I am the very model of a modern major general.
Text 2: `Twas brillig, and the slithy toves did gyre and gimble in the wabe.
Diff: I`Twas brillig, amnd the verslithy mtodvels ofdid a modgyrern majornd gimble in ther walbe.
The first step when dealing with a new diff is to transpose and merge like sections. In the above example one such optimization is possible.
Old: I`Twas brillig, amnd ...
New: I`Twas brillig, amnd ...
Both diffs are identical in their output, but the second one has merged two operations into one by transposing a coincidentally repeated equality.
Transposition helps a little bit and is completely safe, but the larger problem is that differences between two dissimilar texts are frequently littered with small coincidental equalities called 'chaff'. The expected result above might be to delete all of 'Text 1' and insert all of 'Text 2', with the possible exception of the period at the end. However most algorithms will salvage bits and pieces, resulting in a mess.
This problem is most apparent in character-based differences since the small set of alphanumeric characters ensures commonalities. A word-based difference of the above example would be distinctly better, but would have inappropriately salvaged " the ". Longer texts would result in more shared words. A line-based difference of the above example would be ideal. However, even line-based differences are vulnerable to inappropriately salvaging blank lines and other common lines (such as "} else {" in source code).
The problem of chaff is actually one of two different problems: efficiency or semantics. Each of these problems requires a different solution.
If the output of the difference is designed for computer use (such as delta compression or input to a patch program) then depending on the subsequent application or storage method, each edit operation may have some fixed computational overhead associated with it in addition to the number of characters within that edit. For instance, fifty single-character edits might take more storage or take longer for the next application to process than a single fifty-character edit. Once the trade-off has been measured, the computational or storage cost of an edit operation may be stated in terms of the equivalent cost of characters of change. If this cost is zero, then there is no overhead. If this cost is (for example) ten characters, then increasing the total number of characters edited by up to nine, while reducing the number of edit operations by one, would result in a net savings. Thus the total cost of a difference can be computed as o * c + n
where o
is the number of edit operations, c
is the constant cost of each edit operation in terms of characters, and n
is the total number of characters changed. Below are three examples (with c
set arbitrarily at 4) showing how increasing the number of edited characters can reduce the number of edit operations and reduce the overall cost of the difference.
First, any equality (text which remains unchanged) which is surrounded on both sides by an existing insertion and deletion need be less than c
characters long for it to be advantageous to split it.
Operations | Characters | Cost | |
Text 1: ABXYZCD
|
|
|
|
Secondly, any equality which is surrounded on one side by an existing insertion and deletion, and on the other side by an existing insertion or deletion, need be less than half c
characters long for it to be advantageous to split it.
Operations | Characters | Cost | |
Text 1: XCD
|
|
|
|
Both of these conditions may be computed quickly by making a single pass through the data, backtracking to reevaluate the previous equality if a split has changed the type of edits surrounding it. Another pass is made to reorder the edit operations and merge like ones together.
Although this is a good start, it is not a complete solution since it does not catch a third type of condition:
Operations | Characters | Cost | |
Text 1: ABCD
|
|
|
|
In this and similar cases, each individual split would result in a higher total cost, yet these splits, when combined, result in a lower total cost. Computing this form of optimisation appears to be an O(n2) operation on selected regions of the difference (as opposed to the O(n) optimisation for the first two cases), thus it may be more costly than the savings themselves.
If the output of the difference is designed for human use (such as a visual display), the problem changes. In this case the goal is to provide more meaningful divisions. Consider these two examples:
Text 1: Quicq fyre | Text 1: Slow fool
Text 2: Quick fire | Text 2: Quick fire
Diff: Quicqk fyire | Diff: SlowQuick foolire
Split: Quicqk f fyire | Split: SlowQuick f foolire
Merge: Quicq fyk fire | Merge: Slow foolQuick fire
Mathematically, these examples are very similar. They have the same central equality (" f") and they have the same number of edit operations. Yet the first example (which involves correcting two typographical errors) is more meaningful in its raw diff stage, rather than after splitting and merging the equality. Whereas the second example (which involves larger edits) has little meaning at its raw diff stage, and is much clearer after splitting and merging the equality. The primary distinction between these two examples is the amount of change surrounding the equality.
One solution for removing semantic chaff is to pass over the data looking for equalities that are smaller than or equal to the insertions and deletions on both sides of them. When such an equality is found, it is split into a deletion and an addition. Then a second pass is made to reorder and merge all deletions and additions which aren't separated by surviving equalities. Below is a somewhat contrived example showing the these steps:
Text 1: Hovering
Text 2: My government
Diff: HMy goveringment
Split 1: HMy goverinngment
Split 2: HMy goveroverinngment
Merge: HoveringMy government
In this case "over" is four letters long, compared with only five and one letters of changes surrounding it, so it is left. However, "n" is only one letter, compared with one and five letters of changes surrounding it. Therefore "n" is split. Once an equality is split, the pass must backtrack to reevaluate the previous equality since its context has changed. In this case "over" is now surrounded by five and eight letters of changes, so it too is split. Finally all the pieces are collected together, resulting in an easily understandable difference.
This solution is not perfect. It has tunnel vision; it is unable to see beyond the immediate neighbourhood of each equality it evaluates. This can result in small groups of chaff surviving:
Text 1: It was a dark and stormy night.
Text 2: The black can in the cupboard.
Diff: ItThe wblas a darck cand sin the cupboarmy nightd.
Split: ItThe wblaas a darck cand sin tthe cupbooarrmy nightd.
Merge: It was a darThe black cand stormy night in the cupboard.
A more comprehensive solution might compute a weighted average of differences further away from the equality in question.
A separate issue with creating semantically meaningful diffs is aligning edit boundaries to logical divisions. Consider the following diffs:
Text 1: That cartoon.
Text 2: That cat cartoon.
Diff 1: That cat cartoon.
Diff 2: That cat cartoon.
Diff 3: That cat cartoon.
Diff 4: That cat cartoon.
Diff 5: That cat cartoon.
Diff 6: That cat cartoon.
All six diffs are valid and minimal. Diffs 1 and 6 are the ones most likely to be returned by diff algorithms. But diffs 3 and 4 are more likely to capture the semantic meaning of the diff.
The solution is to locate each insertion or deletion which is surrounded on both sides by equalities, and attempt to slide them sideways. If the last token of the preceeding equality equals the last token of the edit, then the edit may be slid left. Likewise if the first token of the edit equals the first token of the following equality, then the edit may be slit right. Each of the possible locations can be scored based on whether the boundaries appear to be logical. One scheme which works is:
This scheme would give scores of zero to diffs 1, 2, 5 and 6, while giving scores of four to diffs 3 and 4.
See an implementation and online demonstration of diff.
See also the companion paper on diff's counterpart: Patch