Diff Strategies

by Neil Fraser, April 2006

Computing the differences between two sequences is at the core of many applications. Below is a simple example of the difference between two texts:

Text 1: Apples are a fruit.

Text 2: Bananas are also fruit.

Diff:   AppleBananas are also fruit.

This paper surveys the literature on difference algorithms, compares them, and describes several techniques for improving the usability of these algorithms in practice. In particular, it discusses pre-processing optimisations, strategies for selecting the best difference algorithm for the job, and post-processing cleanup.

1 Pre-processing Optimisations

Even the best-known difference algorithms are computationally expensive processes. In most real-world instances, the two sequences (usually text) being compared are similar to each other to a certain extent. This observation enables several optimisations that can improve the actual running time of an algorithm, and in certain cases, that can even obviate the need for running the algorithm altogether.

1.1 Equality

The most obvious and the simplest optimisation is the equality test. Since there is a non-trivial chance that the two sequences are identical, and the test for this case is so trivial, it is logical to test for this case first. One side effect of this test is that it may simplify subsequent code. After this test, there is guaranteed to be a difference; the null case is eliminated.

Sample Code

1.2 Common Prefix/Suffix

If there is any commonality at all between the texts, it is likely that they will share a common substring at the start and/or the end.

Text 1: The cat in the hat.

Text 2: The dog in the hat.

This can be simplified down to:

Text 1: cat

Text 2: dog

Locating these common substrings can be done in O(log n) using a binary search. Since binary searches are least efficient at their extreme points and it is not uncommon in the real-world to have zero commonality, it makes sense to do a quick check of the first (or last) character before starting the search.

(This section generates a lot of email. The issue is that string equality operations (a == b) are typically O(n), thus the described algorithm would be O(n log n). However, when dealing with high level languages, the speed difference between loops and equality operations is such that for all practical purposes the equality opeation can be considered to be O(1). Further complicating the matter are languages like Python which use hash tables for all strings, thus making equality checking be O(1) and string creation be O(n). For more information, see the performance testing.)

Sample Code

The GNU diff program (which does linear matching for prefixes and suffixes) claims in their documentation that "occasionally [prefix & suffix stripping] may produce non-minimal output", though they do not provide an example of this.

1.3 Singular Insertion/Deletion

A very common difference is the insertion or the deletion of some text:

Text 1: The cat in the hat.             | Text 1: The cat in the hat.

Text 2: The furry cat in the hat.       | Text 2: The cat.

After removing the common prefixes and suffixes one gets:

Text 1:                                 | Text 1:  in the hat

Text 2: furry                           |  Text 2:

The presence of an empty 'Text 1' in the first example indicates that 'Text 2' is an insertion. The presence of an empty 'Text 2' in the second example indicates that 'Text 1' is a deletion. Detecting these common cases avoids the need to run a difference algorithm at all.

Sample Code

1.4 Two Edits

Detecting and dealing with two edits is more challenging than singular edits. Two simple insertions can be detected by looking for the presence of 'Text 1' within 'Text 2'. Likewise two simple deletions can be detected by looking for the presence of 'Text 2' in 'Text 1'.


Text 1: The cat in the hat.

Text 2: The happy cat in the black hat.

Removing the common prefixes and suffixes as a first step guarantees that there must be differences at each end of the remaining texts. It is then easy to determine that the shorter string ("cat in the") is present within the longer string ("happy cat in the black"). In these situations the difference may be determined without running a difference algorithm.

Sample Code

The situation is more complicated if the edits aren't two simple insertions or two simple deletions. These cases may often be detected if the two edits are separated by considerable text:

Text 1: The cat in the hat.

Text 2: The ox in the box.

After removing the common prefixes and suffixes one gets:

Text 1: cat in the hat

Text 2: ox in the box

If a substring exists in both texts which is at least half the length of the longer text, then it is guaranteed to be common. In this case the texts can be split in two, and separate differences carried out:

Text 1: cat     | Text 1: hat

Text 2: ox      | Text 2: box

Performing this test recursively may, in general, yield further subdivisions, although there are no such subdivisions in the above example.

Computing the longest common substring is an operation about as complex as computing the difference, which would mean there would be no savings. However, the limitation that the common substring must be at least half the length of the longer text provides a shortcut. As the diagram below illustrates, if a common substring of such a length exists, then the second quarter and/or third quarter of the longest string must form part of this substring.

0	¼	½	¾	1

				← Drag the 'half length' bar with your mouse.

The smaller text can be searched for matches of these two quarters, and the context of any matches can be compared in both texts by looking for common prefixes and suffixes. The strings may be split at the location of the longest match which is equal to or greater than the half the length of the longer text. Due to the problem of repeated strings, all matches of each quarter in the smaller text must be checked, not just the first one which reaches the necessary length.

Sample Code

// Check to see if the problem can be split in two.
var hm = diff_halfMatch(text1, text2);
if (hm) {
  // A half-match was found, sort out the return data.
  var text1_a = hm[0];
  var text1_b = hm[1];
  var text2_a = hm[2];
  var text2_b = hm[3];
  var mid_common = hm[4];
  // Send both pairs off for separate processing.
  var diffs_a = diff_main(text1_a, text2_a);
  var diffs_b = diff_main(text1_b, text2_b);
  // Merge the results.
  return diffs_a.concat([[DIFF_EQUAL, mid_common]], diffs_b);
}

function diff_halfMatch(text1, text2) {
  // Do the two texts share a substring which is at least half the length of the
  // longer text?
  var longtext = text1.length > text2.length ? text1 : text2;
  var shorttext = text1.length > text2.length ? text2 : text1;
  if (longtext.length < 10 || shorttext.length < 1) {
    return null;  // Pointless.
  }

// First check if the second quarter is the seed for a half-match.
  var hm1 = diff_halfMatchI(longtext, shorttext,
                             Math.ceil(longtext.length / 4));
  // Check again based on the third quarter.
  var hm2 = diff_halfMatchI(longtext, shorttext,
                             Math.ceil(longtext.length / 2));
  var hm;
  if (!hm1 && !hm2) {
    return null;
  } else if (!hm2) {
    hm = hm1;
  } else if (!hm1) {
    hm = hm2;
  } else {
    // Both matched.  Select the longest.
    hm = hm1[4].length > hm2[4].length ? hm1 : hm2;
  }

// A half-match was found, sort out the return data.
  var text1_a, text1_b, text2_a, text2_b;
  if (text1.length > text2.length) {
    text1_a = hm[0];
    text1_b = hm[1];
    text2_a = hm[2];
    text2_b = hm[3];
  } else {
    text2_a = hm[0];
    text2_b = hm[1];
    text1_a = hm[2];
    text1_b = hm[3];
  }
  var mid_common = hm[4];
  return [text1_a, text1_b, text2_a, text2_b, mid_common];
}

See earlier code for 'diff_commonPrefix' and 'diff_commonSuffix' functions.

// Check to see if the problem can be split in two.
String[] hm = diff_halfMatch(text1, text2);
if (hm != null) {
  // A half-match was found, sort out the return data.
  String text1_a = hm[0];
  String text1_b = hm[1];
  String text2_a = hm[2];
  String text2_b = hm[3];
  String mid_common = hm[4];
  // Send both pairs off for separate processing.
  LinkedList<Diff> diffs_a = diff_main(text1_a, text2_a);
  LinkedList<Diff> diffs_b = diff_main(text1_b, text2_b);
  // Merge the results.
  diffs = diffs_a;
  diffs.add(new Diff(Operation.EQUAL, mid_common));
  diffs.addAll(diffs_b);
  return diffs;
}

protected String[] diff_halfMatch(String text1, String text2) {
  // Do the two texts share a substring which is at least half the length of the
  // longer text?
  String longtext = text1.length() > text2.length() ? text1 : text2;
  String shorttext = text1.length() > text2.length() ? text2 : text1;
  if (longtext.length() < 10 || shorttext.length() < 1) {
    return null;  // Pointless.
  }

// First check if the second quarter is the seed for a half-match.
  String[] hm1 = diff_halfMatchI(longtext, shorttext,
                                 (longtext.length() + 3) / 4);
  // Check again based on the third quarter.
  String[] hm2 = diff_halfMatchI(longtext, shorttext,
                                 (longtext.length() + 1) / 2);
  String[] hm;
  if (hm1 == null && hm2 == null) {
    return null;
  } else if (hm2 == null) {
    hm = hm1;
  } else if (hm1 == null) {
    hm = hm2;
  } else {
    // Both matched.  Select the longest.
    hm = hm1[4].length() > hm2[4].length() ? hm1 : hm2;
  }

// A half-match was found, sort out the return data.
  if (text1.length() > text2.length()) {
    return hm;
    //return new String[]{hm[0], hm[1], hm[2], hm[3], hm[4]};
  } else {
    return new String[]{hm[2], hm[3], hm[0], hm[1], hm[4]};
  }
}

private String[] diff_halfMatchI(String longtext, String shorttext, int i) {
  // Start with a 1/4 length substring at position i as a seed.
  String seed = longtext.substring(i, i + longtext.length() / 4);
  int j = -1;
  String best_common = "";
  String best_longtext_a = "", best_longtext_b = "";
  String best_shorttext_a = "", best_shorttext_b = "";
  while ((j = shorttext.indexOf(seed, j + 1)) != -1) {
    int prefixLength = diff_commonPrefix(longtext.substring(i),
                                         shorttext.substring(j));
    int suffixLength = diff_commonSuffix(longtext.substring(0, i),
                                         shorttext.substring(0, j));
    if (best_common.length() < suffixLength + prefixLength) {
      best_common = shorttext.substring(j - suffixLength, j)
          + shorttext.substring(j, j + prefixLength);
      best_longtext_a = longtext.substring(0, i - suffixLength);
      best_longtext_b = longtext.substring(i + prefixLength);
      best_shorttext_a = shorttext.substring(0, j - suffixLength);
      best_shorttext_b = shorttext.substring(j + prefixLength);
    }
  }
  if (best_common.length() >= longtext.length() / 2) {
    return new String[]{best_longtext_a, best_longtext_b,
                        best_shorttext_a, best_shorttext_b, best_common};
  } else {
    return null;
  }
}

See earlier code for 'diff_commonPrefix' and 'diff_commonSuffix' functions.

# Check to see if the problem can be split in two.
hm = diff_halfMatch(text1, text2)
if hm:
  # A half-match was found, sort out the return data.
  (text1_a, text1_b, text2_a, text2_b, mid_common) = hm
  # Send both pairs off for separate processing.
  diffs_a = diff_main(text1_a, text2_a)
  diffs_b = diff_main(text1_b, text2_b)
  # Merge the results.
  return diffs_a + [(DIFF_EQUAL, mid_common)] + diffs_b

def diff_halfMatch(text1, text2):
  # Do the two texts share a substring which is at least half the length of the
  # longer text?
  if len(text1) > len(text2):
    (longtext, shorttext) = (text1, text2)
  else:
    (shorttext, longtext) = (text1, text2)
  if len(longtext) < 10 or len(shorttext) < 1:
    return None  # Pointless.

def diff_halfMatchI(longtext, shorttext, i):
    seed = longtext[i:i + len(longtext) / 4]
    best_common = ''
    j = shorttext.find(seed)
    while j != -1:
      prefixLength = diff_commonPrefix(longtext[i:], shorttext[j:])
      suffixLength = diff_commonSuffix(longtext[:i], shorttext[:j])
      # print "%s|%s+%s|%s vs. %s|%s+%s|%s" %
      #     (my_suffix[0], my_suffix[2], my_prefix[2], my_prefix[0],
      #     my_suffix[1], my_suffix[2], my_prefix[2], my_prefix[1])
      if len(best_common) < suffixLength + prefixLength:
        best_common = (shorttext[j - suffixLength:j] +
            shorttext[j:j + prefixLength])
        best_longtext_a = longtext[:i - suffixLength]
        best_longtext_b = longtext[i + prefixLength:]
        best_shorttext_a = shorttext[:j - suffixLength]
        best_shorttext_b = shorttext[j + prefixLength:]
      j = shorttext.find(seed, j + 1)

if len(best_common) >= len(longtext) / 2:
      return (best_longtext_a, best_longtext_b,
              best_shorttext_a, best_shorttext_b, best_common)
    else:
      return None

# First check if the second quarter is the seed for a half-match.
  hm1 = diff_halfMatchI(longtext, shorttext, (len(longtext) + 3) / 4)
  # Check again based on the third quarter.
  hm2 = diff_halfMatchI(longtext, shorttext, (len(longtext) + 1) / 2)
  if not hm1 and not hm2:
    return None
  elif not hm2:
    hm = hm1
  elif not hm1:
    hm = hm2
  else:
    # Both matched.  Select the longest.
    if len(hm1[4]) > len(hm2[4]):
      hm = hm1
    else:
      hm = hm2

# A half-match was found, sort out the return data.
  if len(text1) > len(text2):
    (text1_a, text1_b, text2_a, text2_b, mid_common) = hm
  else:
    (text2_a, text2_b, text1_a, text1_b, mid_common) = hm
  return (text1_a, text1_b, text2_a, text2_b, mid_common)

See earlier code for 'diff_commonPrefix' and 'diff_commonSuffix' functions.

2 Difference Algorithms

Once the pre-processing optimisation is complete, the remaining text is compared with a difference algorithm. A brute-force technique would take O(n₁*n₂) to execute (where n₁ and n₂ are the lengths of each input). Since this is clearly unscalable in practical applications where the text lengths are arbitrary, much research has been conducted on better algorithms which approach O(n₁+n₂). However, these algorithms are not interchangeable. There are several criteria beyond speed which are important.

2.1 Input

There are three common modes for comparing input texts:

Text 1: The cat in the hat.

Text 2: The bird in the hand.

Char diff: The catbird in the hatnd.

Word diff: The catbird in the hathand.

Line diff: The cat in the hat.The bird in the hand.

Comparing by individual characters produces the finest level of detail but takes the longest to execute due to the larger number of tokens. Comparing by word boundaries or line breaks is faster and produces fewer individual edits, but the total length of the edits is larger. The required level of detail varies depending on the application. For instance comparing source code is generally done on a line-by-line basis, whereas comparing an English document is generally done on a word-by-word basis, and binary data or DNA sequences is generally done on a character-by-character basis.

Any difference algorithm could theoretically process any input, regardless of whether it is split by characters, words or lines. However, some difference algorithms are much more efficient at handling small tokens such as characters, others are much more efficient at handling large tokens such as lines. The reason is that there are an infinite number of possible lines, and any line which does not appear in one text but appears in the other is automatically known to be an insertion or a deletion. Conversely, there are only 80 or so distinct tokens when processing characters (a-z, A-Z, 0-9 and some punctuation), which means that any non-trivial text will contain multiple instances of most if not all these characters. Different algorithms can exploit these statistical differences in the input texts, resulting in more efficient strategies. An algorithm which is specifically designed for line-by-line differences is described in J. Hunt and M. McIlroy's 1976 paper: An Algorithm for Differential File Comparison.

Another factor to consider is the availability of useful functions. Most computer languages have superior string handling facilities (such as regular expressions) when compared with array handling facilities. These more powerful string functions may make character-based difference algorithms easier to program. On the other hand, the advent of Unicode support in many languages means that strings may contain alphabet sizes as great as 65,536. This allows words or lines to be hashed down to a single character so that the difference algorithm can make use of strings instead of arrays. To put this in perspective, the King James Bible contains 30,833 unique lines and 28,880 unique 'words' (just space-delimited, with leading or trailing punctuation not separated).

2.2 Output

Traditional difference algorithms produce a list of insertions and deletions which when performed on the first text will result in the second text. An extension to this is the addition of a 'move' operator:

Text 1: The black cat in the hat?

Text 2: The cat in the black hat!

Ins & Del: The black cat in the black hat?!

...& Move: The ^cat in the black hat?!

When a large block of text has moved from one location to another, it is often more understandable to report this as a move rather than a deletion and an insertion. An algorithm which uses the 'move' operator is described in P. Heckel's 1978 paper: A technique for isolating differences between files.

An entirely different approach is to use 'copy' and 'insert' as operators:


Text 1: The black cat in the hat?

Text 2: The black hat on the black cat!

Copy & Ins: The black hat on the black cat!

This approach uses fragments from the first text, which are copied and pasted to form the second text. Much like clipping out words from a newspaper to compose a ransom note, except that the any clipped word may be photocopied and used multiple times. Any entirely new text is inserted verbatim. Copy/insert differences are generally not human-readable. However they are significantly faster to compute making them superior than insert/delete differences for delta compression applications. An algorithm which uses the 'copy' and 'insert' operators is described in J. MacDonald's 2000 paper: File System Support for Delta Compression.

2.3 Accuracy

No difference algorithm should ever return an incorrect output; that is, an output which does not describe a valid path of differences from one text to another. However, some algorithms may return sub-optimal outputs in the interests of speed. For instance, Heckel's algorithm (1978) is quick, but gets confused if repeated text exists in the inputs:

Text 1:  A X X X X B

Text 2:  C X X X X D

Optimal: AC X X X X BD

Heckel:  A X X X X BC X X X X D

Another example of sacrificing accuracy for speed is to process the whole texts with a line-based algorithm, then reprocess each run of modified lines with a character-based algorithm. The problem with this multi-pass approach is that the line-based difference may sometimes identify inappropriate commonalities between the two lines. Blank lines are a common cause of these since they may appear in two unrelated texts. These inappropriate commonalities serve to randomly split up edit blocks and prevent genuinely common text from being discovered during the character-based phase. A solution to this is to pass the line-based differences through a semantic cleanup algorithm (as described below in section 3.2) before performing the character-based differences. In cases involving multiple edits throughout a long document, performing a high-level difference followed by a low-level difference can result in an order of magnitude improvement in speed and memory requirements. However, there remains a risk that the resulting difference path may not be the shortest one possible.

Arguably the best general-purpose difference algorithm is described in E. Myers' 1986 paper: An O(ND) Difference Algorithm and Its Variations. One of the proposed optimisations is to process the difference from both ends simultaneously, meeting at the middle. In most cases this improves the performance by up to 50%.

3 Post-processing Cleanup

A perfect difference algorithm will report the minimum number of edits required to convert one text into the other. However, sometimes the result is too perfect:

Text 1: I am the very model of a modern major general.

Text 2: `Twas brillig, and the slithy toves did gyre and gimble in the wabe.

Diff:   I`Twas brillig, amnd the verslithy mtodvels ofdid a modgyrern majornd gimble in ther walbe.

The first step when dealing with a new diff is to transpose and merge like sections. In the above example one such optimization is possible.


Old: I`Twas brillig, amnd ...

New: I`Twas brillig, amnd ...

Both diffs are identical in their output, but the second one has merged two operations into one by transposing a coincidentally repeated equality.

Sample Code

// Second pass: look for single edits surrounded on both sides by equalities
  // which can be shifted sideways to eliminate an equality.
  // e.g: A<ins>BA</ins>C -> <ins>AB</ins>AC
  var changes = false;
  pointer = 1;
  // Intentionally ignore the first and last element (don't need checking).
  while (pointer < diffs.length - 1) {
    if (diffs[pointer - 1][0] == DIFF_EQUAL &&
        diffs[pointer + 1][0] == DIFF_EQUAL) {
      // This is a single edit surrounded by equalities.
      if (diffs[pointer][1].substring(diffs[pointer][1].length -
          diffs[pointer - 1][1].length) == diffs[pointer - 1][1]) {
        // Shift the edit over the previous equality.
        diffs[pointer][1] = diffs[pointer - 1][1] +
            diffs[pointer][1].substring(0, diffs[pointer][1].length -
                                        diffs[pointer - 1][1].length);
        diffs[pointer + 1][1] = diffs[pointer - 1][1] + diffs[pointer + 1][1];
        diffs.splice(pointer - 1, 1);
        changes = true;
      } else if (diffs[pointer][1].substring(0, diffs[pointer + 1][1].length) ==
          diffs[pointer + 1][1]) {
        // Shift the edit over the next equality.
        diffs[pointer - 1][1] += diffs[pointer + 1][1];
        diffs[pointer][1] =
            diffs[pointer][1].substring(diffs[pointer + 1][1].length) +
            diffs[pointer + 1][1];
        diffs.splice(pointer + 1, 1);
        changes = true;
      }
    }
    pointer++;
  }
  // If shifts were made, the diff needs reordering and another shift sweep.
  if (changes) {
    diff_cleanupMerge(diffs);
  }
}

public void diff_cleanupMerge(LinkedList<Diff> diffs) {
  // Reorder and merge like edit sections.  Merge equalities.
  // Any edit section can move as long as it does not cross an equality.
  diffs.add(new Diff(Operation.EQUAL, ""));  // Add a dummy entry at the end.
  ListIterator<Diff> pointer = diffs.listIterator();
  int count_delete = 0;
  int count_insert = 0;
  String text_delete = "";
  String text_insert = "";
  Diff thisDiff = pointer.next();
  Diff prevEqual = null;
  int commonlength;
  while (thisDiff != null) {
    switch (thisDiff.operation) {
    case INSERT:
      count_insert++;
      text_insert += thisDiff.text;
      prevEqual = null;
      break;
    case DELETE:
      count_delete++;
      text_delete += thisDiff.text;
      prevEqual = null;
      break;
    case EQUAL:
      if (count_delete != 0 || count_insert != 0) {
        // Delete the offending records.
        pointer.previous();  // Reverse direction.
        while (count_delete-- > 0) {
          pointer.previous();
          pointer.remove();
        }
        while (count_insert-- > 0) {
          pointer.previous();
          pointer.remove();
        }
        if (count_delete != 0 && count_insert != 0) {
          // Factor out any common prefixies.
          commonlength = diff_commonPrefix(text_insert, text_delete);
          if (commonlength != 0) {
            if (pointer.hasPrevious()) {
              thisDiff = pointer.previous();
              assert thisDiff.operation == Operation.EQUAL
                     : "Previous diff should have been an equality.";
              thisDiff.text += text_insert.substring(0, commonlength);
              pointer.next();
            } else {
              pointer.add(new Diff(Operation.EQUAL,
                  text_insert.substring(0, commonlength)));
            }
            text_insert = text_insert.substring(commonlength);
            text_delete = text_delete.substring(commonlength);
          }
          // Factor out any common suffixies.
          commonlength = diff_commonSuffix(text_insert, text_delete);
          if (commonlength != 0) {
            thisDiff = pointer.next();
            thisDiff.text = text_insert.substring(text_insert.length()
                - commonlength) + thisDiff.text;
            text_insert = text_insert.substring(0, text_insert.length()
                - commonlength);
            text_delete = text_delete.substring(0, text_delete.length()
                - commonlength);
            pointer.previous();
          }
        }
        // Insert the merged records.
        if (text_delete.length() != 0) {
          pointer.add(new Diff(Operation.DELETE, text_delete));
        }
        if (text_insert.length() != 0) {
          pointer.add(new Diff(Operation.INSERT, text_insert));
        }
        // Step forward to the equality.
        thisDiff = pointer.hasNext() ? pointer.next() : null;
      } else if (prevEqual != null) {
        // Merge this equality with the previous one.
        prevEqual.text += thisDiff.text;
        pointer.remove();
        thisDiff = pointer.previous();
        pointer.next();  // Forward direction
      }
      count_insert = 0;
      count_delete = 0;
      text_delete = "";
      text_insert = "";
      prevEqual = thisDiff;
      break;
    }
    thisDiff = pointer.hasNext() ? pointer.next() : null;
  }
  // System.out.println(diff);
  if (diffs.getLast().text.length() == 0) {
    diffs.removeLast();  // Remove the dummy entry at the end.
  }

/*
   * Second pass: look for single edits surrounded on both sides by equalities
   * which can be shifted sideways to eliminate an equality.
   * e.g: A<ins>BA</ins>C -> <ins>AB</ins>AC
   */
  boolean changes = false;
  // Create a new iterator at the start.
  // (As opposed to walking the current one back.)
  pointer = diffs.listIterator();
  Diff prevDiff = pointer.hasNext() ? pointer.next() : null;
  thisDiff = pointer.hasNext() ? pointer.next() : null;
  Diff nextDiff = pointer.hasNext() ? pointer.next() : null;
  // Intentionally ignore the first and last element (don't need checking).
  while (nextDiff != null) {
    if (prevDiff.operation == Operation.EQUAL &&
        nextDiff.operation == Operation.EQUAL) {
      // This is a single edit surrounded by equalities.
      if (thisDiff.text.endsWith(prevDiff.text)) {
        // Shift the edit over the previous equality.
        thisDiff.text = prevDiff.text
            + thisDiff.text.substring(0, thisDiff.text.length()
                                         - prevDiff.text.length());
        nextDiff.text = prevDiff.text + nextDiff.text;
        pointer.previous(); // Walk past nextDiff.
        pointer.previous(); // Walk past thisDiff.
        pointer.previous(); // Walk past prevDiff.
        pointer.remove(); // Delete prevDiff.
        pointer.next(); // Walk past thisDiff.
        thisDiff = pointer.next(); // Walk past nextDiff.
        nextDiff = pointer.hasNext() ? pointer.next() : null;
        changes = true;
      } else if (thisDiff.text.startsWith(nextDiff.text)) {
        // Shift the edit over the next equality.
        prevDiff.text += nextDiff.text;
        thisDiff.text = thisDiff.text.substring(nextDiff.text.length()) +
            nextDiff.text;
        pointer.remove(); // Delete nextDiff.
        nextDiff = pointer.hasNext() ? pointer.next() : null;
        changes = true;
      }
    }
    prevDiff = thisDiff;
    thisDiff = nextDiff;
    nextDiff = pointer.hasNext() ? pointer.next() : null;
  }
  // If shifts were made, the diff needs reordering and another shift sweep.
  if (changes) {
    diff_cleanupMerge(diffs);
  }
}

def diff_cleanupMerge(diffs):
  # Reorder and merge like edit sections.  Merge equalities.
  # Any edit section can move as long as it does not cross an equality.
  diffs.append((DIFF_EQUAL, ''))  # Add a dummy entry at the end.
  pointer = 0
  count_delete = 0
  count_insert = 0
  text_delete = ''
  text_insert = ''
  while pointer < len(diffs):
    if diffs[pointer][0] == DIFF_INSERT:
      count_insert += 1
      text_insert += diffs[pointer][1]
      pointer += 1
    elif diffs[pointer][0] == DIFF_DELETE:
      count_delete += 1
      text_delete += diffs[pointer][1]
      pointer += 1
    elif diffs[pointer][0] == DIFF_EQUAL:
      # Upon reaching an equality, check for prior redundancies.
      if count_delete != 0 or count_insert != 0:
        if count_delete != 0 and count_insert != 0:
          # Factor out any common prefixies.
          commonlength = diff_commonPrefix(text_insert, text_delete)
          if commonlength != 0:
            x = pointer - count_delete - count_insert - 1
            if x >= 0 and diffs[x][0] == DIFF_EQUAL:
              diffs[x] = (diffs[x][0], diffs[x][1] +
                          text_insert[:commonlength])
            else:
              diffs.insert(0, (DIFF_EQUAL, text_insert[:commonlength]))
              pointer += 1
            text_insert = text_insert[commonlength:]
            text_delete = text_delete[commonlength:]
          # Factor out any common suffixies.
          commonlength = diff_commonSuffix(text_insert, text_delete)
          if commonlength != 0:
            diffs[pointer] = (diffs[pointer][0], text_insert[-commonlength:] +
                diffs[pointer][1])
            text_insert = text_insert[:-commonlength]
            text_delete = text_delete[:-commonlength]
        # Delete the offending records and add the merged ones.
        if count_delete == 0:
          diffs[pointer - count_insert : pointer] = [
              (DIFF_INSERT, text_insert)]
        elif count_insert == 0:
          diffs[pointer - count_delete : pointer] = [
              (DIFF_DELETE, text_delete)]
        else:
          diffs[pointer - count_delete - count_insert : pointer] = [
              (DIFF_DELETE, text_delete),
              (DIFF_INSERT, text_insert)]
        pointer = pointer - count_delete - count_insert + 1
        if count_delete != 0:
          pointer += 1
        if count_insert != 0:
          pointer += 1
      elif pointer != 0 and diffs[pointer - 1][0] == DIFF_EQUAL:
        # Merge this equality with the previous one.
        diffs[pointer - 1] = (diffs[pointer - 1][0],
                              diffs[pointer - 1][1] + diffs[pointer][1])
        del diffs[pointer]
      else:
        pointer += 1

count_insert = 0
      count_delete = 0
      text_delete = ''
      text_insert = ''

if diffs[-1][1] == '':
    diffs.pop()  # Remove the dummy entry at the end.

# Second pass: look for single edits surrounded on both sides by equalities
  # which can be shifted sideways to eliminate an equality.
  # e.g: A<ins>BA</ins>C -> <ins>AB</ins>AC
  changes = False
  pointer = 1
  # Intentionally ignore the first and last element (don't need checking).
  while pointer < len(diffs) - 1:
    if (diffs[pointer - 1][0] == DIFF_EQUAL and
        diffs[pointer + 1][0] == DIFF_EQUAL):
      # This is a single edit surrounded by equalities.
      if diffs[pointer][1].endswith(diffs[pointer - 1][1]):
        # Shift the edit over the previous equality.
        diffs[pointer] = (diffs[pointer][0],
            diffs[pointer - 1][1] +
            diffs[pointer][1][:-len(diffs[pointer - 1][1])])
        diffs[pointer + 1] = (diffs[pointer + 1][0],
                              diffs[pointer - 1][1] + diffs[pointer + 1][1])
        del diffs[pointer - 1]
        changes = True
      elif diffs[pointer][1].startswith(diffs[pointer + 1][1]):
        # Shift the edit over the next equality.
        diffs[pointer - 1] = (diffs[pointer - 1][0],
                              diffs[pointer - 1][1] + diffs[pointer + 1][1])
        diffs[pointer] = (diffs[pointer][0],
            diffs[pointer][1][len(diffs[pointer + 1][1]):] +
            diffs[pointer + 1][1])
        del diffs[pointer + 1]
        changes = True
    pointer += 1

# If shifts were made, the diff needs reordering and another shift sweep.
  if changes:
    diff_cleanupMerge(diffs)

Transposition helps a little bit and is completely safe, but the larger problem is that differences between two dissimilar texts are frequently littered with small coincidental equalities called 'chaff'. The expected result above might be to delete all of 'Text 1' and insert all of 'Text 2', with the possible exception of the period at the end. However most algorithms will salvage bits and pieces, resulting in a mess.

This problem is most apparent in character-based differences since the small set of alphanumeric characters ensures commonalities. A word-based difference of the above example would be distinctly better, but would have inappropriately salvaged " the ". Longer texts would result in more shared words. A line-based difference of the above example would be ideal. However, even line-based differences are vulnerable to inappropriately salvaging blank lines and other common lines (such as "} else {" in source code).

The problem of chaff is actually one of two different problems: efficiency or semantics. Each of these problems requires a different solution.

3.1 Efficiency

If the output of the difference is designed for computer use (such as delta compression or input to a patch program) then depending on the subsequent application or storage method, each edit operation may have some fixed computational overhead associated with it in addition to the number of characters within that edit. For instance, fifty single-character edits might take more storage or take longer for the next application to process than a single fifty-character edit. Once the trade-off has been measured, the computational or storage cost of an edit operation may be stated in terms of the equivalent cost of characters of change. If this cost is zero, then there is no overhead. If this cost is (for example) ten characters, then increasing the total number of characters edited by up to nine, while reducing the number of edit operations by one, would result in a net savings. Thus the total cost of a difference can be computed as o * c + n where o is the number of edit operations, c is the constant cost of each edit operation in terms of characters, and n is the total number of characters changed. Below are three examples (with c set arbitrarily at 4) showing how increasing the number of edited characters can reduce the number of edit operations and reduce the overall cost of the difference.

First, any equality (text which remains unchanged) which is surrounded on both sides by an existing insertion and deletion need be less than c characters long for it to be advantageous to split it.

	`Operations`	`Characters`	`Cost`
`Text 1: ABXYZCD Text 2: 12XYZ34 Diff: AB12XYZCD34 Split: AB12XYZXYZCD34 Merge: ABXYZCD12XYZ34`	`4 * 4 6 * 4 2 * 4`	`+ 8 + 14 + 14`	`= 24 = 38 = 22`

Secondly, any equality which is surrounded on one side by an existing insertion and deletion, and on the other side by an existing insertion or deletion, need be less than half c characters long for it to be advantageous to split it.

	`Operations`	`Characters`	`Cost`
`Text 1: XCD Text 2: 12X34 Diff: 12XCD34 Split: 12XXCD34 Merge: XCD12X34`	`3 * 4 5 * 4 2 * 4`	`+ 6 + 8 + 8`	`= 18 = 28 = 16`

Both of these conditions may be computed quickly by making a single pass through the data, backtracking to reevaluate the previous equality if a split has changed the type of edits surrounding it. Another pass is made to reorder the edit operations and merge like ones together.

Sample Code

if (changes) {
    diff_cleanupMerge(diffs);
  }
}

See earlier code for 'diff_cleanupMerge' function.

public void diff_cleanupEfficiency(LinkedList<Diff> diffs) {
  // Reduce the number of edits by eliminating operationally trivial
  // equalities.
  if (diffs.isEmpty()) {
    return;
  }
  boolean changes = false;
  Stack<Diff> equalities = new Stack<Diff>();  // Stack of equalities.
  String lastequality = null;  // Always equal to equalities.lastElement().text
  ListIterator<Diff> pointer = diffs.listIterator();
  // Is there an insertion operation before the last equality.
  boolean pre_ins = false;
  // Is there a deletion operation before the last equality.
  boolean pre_del = false;
  // Is there an insertion operation after the last equality.
  boolean post_ins = false;
  // Is there a deletion operation after the last equality.
  boolean post_del = false;
  Diff thisDiff = pointer.next();
  Diff safeDiff = thisDiff;  // The last Diff that is known to be unsplitable.
  while (thisDiff != null) {
    if (thisDiff.operation == Operation.EQUAL) {
      // equality found
      if (thisDiff.text.length() < Diff_EditCost && (post_ins || post_del)) {
        // Candidate found.
        equalities.push(thisDiff);
        pre_ins = post_ins;
        pre_del = post_del;
        lastequality = thisDiff.text;
      } else {
        // Not a candidate, and can never become one.
        equalities.clear();
        lastequality = null;
        safeDiff = thisDiff;
      }
      post_ins = post_del = false;
    } else {
      // an insertion or deletion
      if (thisDiff.operation == Operation.DELETE) {
        post_del = true;
      } else {
        post_ins = true;
      }
      /*
       * Five types to be split:
       * <ins>A</ins><del>B</del>XY<ins>C</ins><del>D</del>
       * <ins>A</ins>X<ins>C</ins><del>D</del>
       * <ins>A</ins><del>B</del>X<ins>C</ins>
       * <ins>A</del>X<ins>C</ins><del>D</del>
       * <ins>A</ins><del>B</del>X<del>C</del>
       */
      if (lastequality != null
          && ((pre_ins && pre_del && post_ins && post_del)
              || ((lastequality.length() < Diff_EditCost / 2)
                  && ((pre_ins ? 1 : 0) + (pre_del ? 1 : 0)
                      + (post_ins ? 1 : 0) + (post_del ? 1 : 0)) == 3))) {
        // Walk back to offending equality.
        while (thisDiff != equalities.lastElement()) {
          thisDiff = pointer.previous();
        }
        pointer.next();

// Replace equality with a delete.
        pointer.set(new Diff(Operation.DELETE, lastequality));
        // Insert a coresponding an insert.
        pointer.add(thisDiff = new Diff(Operation.INSERT, lastequality));

equalities.pop();  // Throw away the equality we just deleted.
        lastequality = null;
        if (pre_ins && pre_del) {
          // No changes made which could affect previous entry, keep going.
          post_ins = post_del = true;
          equalities.clear();
          safeDiff = thisDiff;
        } else {
          if (!equalities.empty()) {
            // Throw away the previous equality (it needs to be reevaluated).
            equalities.pop();
          }
          if (equalities.empty()) {
            // There are no previous questionable equalities,
            // walk back to the last known safe diff.
            thisDiff = safeDiff;
          } else {
            // There is an equality we can fall back to.
            thisDiff = equalities.lastElement();
          }
          while (thisDiff != pointer.previous()) {
            // Intentionally empty loop.
          }
          post_ins = post_del = false;
        }

changes = true;
      }
    }
    thisDiff = pointer.hasNext() ? pointer.next() : null;
  }

if (changes) {
    diff_cleanupMerge(diffs);
  }
}

See earlier code for 'diff_cleanupMerge' function.

def diff_cleanupEfficiency(diffs):
  # Reduce the number of edits by eliminating operationally trivial equalities.
  changes = False
  equalities = []  # Stack of indices where equalities are found.
  lastequality = ''  # Always equal to equalities[len(equalities)-1][1]
  pointer = 0  # Index of current position.
  pre_ins = False  # Is there an insertion operation before the last equality.
  pre_del = False  # Is there a deletion operation before the last equality.
  post_ins = False  # Is there an insertion operation after the last equality.
  post_del = False  # Is there a deletion operation after the last equality.
  while pointer < len(diffs):
    if diffs[pointer][0] == DIFF_EQUAL:  # equality found
      if (len(diffs[pointer][1]) < Diff_EditCost and
          (post_ins or post_del)):
        # Candidate found.
        equalities.append(pointer)
        pre_ins = post_ins
        pre_del = post_del
        lastequality = diffs[pointer][1]
      else:
        # Not a candidate, and can never become one.
        equalities = []
        lastequality = ''

post_ins = post_del = False
    else:  # an insertion or deletion
      if diffs[pointer][0] == DIFF_DELETE:
        post_del = True
      else:
        post_ins = True
        
      # Five types to be split:
      # <ins>A</ins><del>B</del>XY<ins>C</ins><del>D</del>
      # <ins>A</ins>X<ins>C</ins><del>D</del>
      # <ins>A</ins><del>B</del>X<ins>C</ins>
      # <ins>A</del>X<ins>C</ins><del>D</del>
      # <ins>A</ins><del>B</del>X<del>C</del>
      
      if lastequality and ((pre_ins and pre_del and post_ins and post_del) or
                           ((len(lastequality) < Diff_EditCost / 2) and
                            (pre_ins + pre_del + post_ins + post_del) == 3)):
        # Duplicate record
        diffs.insert(equalities[len(equalities) - 1],
                     (DIFF_DELETE, lastequality))
        # Change second copy to insert.
        diffs[equalities[len(equalities) - 1] + 1] = (DIFF_INSERT,
            diffs[equalities[len(equalities) - 1] + 1][1])
        equalities.pop()  # Throw away the equality we just deleted
        lastequality = ''
        if pre_ins and pre_del:
          # No changes made which could affect previous entry, keep going.
          post_ins = post_del = True
          equalities = []
        else:
          if len(equalities):
            equalities.pop()  # Throw away the previous equality
          if len(equalities):
            pointer = equalities[len(equalities) - 1]
          else:
            pointer = -1
          post_ins = post_del = False
        changes = True
    pointer += 1

if changes:
    diff_cleanupMerge(diffs)

See earlier code for 'diff_cleanupMerge' function.

Although this is a good start, it is not a complete solution since it does not catch a third type of condition:

	`Operations`	`Characters`	`Cost`
`Text 1: ABCD Text 2: A1B2C3D4 Diff: A1B2C3D4 Split: A1BB2CC3DD4 Merge: ABCD1B2C3D4`	`4 * 4 10 * 4 2 * 4`	`+ 4 + 10 + 10`	`= 20 = 50 = 18`

In this and similar cases, each individual split would result in a higher total cost, yet these splits, when combined, result in a lower total cost. Computing this form of optimisation appears to be an O(n²) operation on selected regions of the difference (as opposed to the O(n) optimisation for the first two cases), thus it may be more costly than the savings themselves.

3.2 Semantics

3.2.1 Semantic Chaff

If the output of the difference is designed for human use (such as a visual display), the problem changes. In this case the goal is to provide more meaningful divisions. Consider these two examples:


Text 1: Quicq fyre      | Text 1: Slow fool

Text 2: Quick fire      | Text 2: Quick fire

Diff:   Quicqk fyire    | Diff:   SlowQuick foolire

Split:  Quicqk f fyire  | Split:  SlowQuick f foolire

Merge:  Quicq fyk fire  | Merge:  Slow foolQuick fire

Mathematically, these examples are very similar. They have the same central equality (" f") and they have the same number of edit operations. Yet the first example (which involves correcting two typographical errors) is more meaningful in its raw diff stage, rather than after splitting and merging the equality. Whereas the second example (which involves larger edits) has little meaning at its raw diff stage, and is much clearer after splitting and merging the equality. The primary distinction between these two examples is the amount of change surrounding the equality.

One solution for removing semantic chaff is to pass over the data looking for equalities that are smaller than or equal to the insertions and deletions on both sides of them. When such an equality is found, it is split into a deletion and an addition. Then a second pass is made to reorder and merge all deletions and additions which aren't separated by surviving equalities. Below is a somewhat contrived example showing the these steps:

Text 1:  Hovering

Text 2:  My government

Diff:    HMy goveringment

Split 1: HMy goverinngment

Split 2: HMy goveroverinngment

Merge:   HoveringMy government

In this case "over" is four letters long, compared with only five and one letters of changes surrounding it, so it is left. However, "n" is only one letter, compared with one and five letters of changes surrounding it. Therefore "n" is split. Once an equality is split, the pass must backtrack to reevaluate the previous equality since its context has changed. In this case "over" is now surrounded by five and eight letters of changes, so it too is split. Finally all the pieces are collected together, resulting in an easily understandable difference.

Sample Code

public void diff_cleanupSemantic(LinkedList<Diff> diffs) {
  // Reduce the number of edits by eliminating semantically trivial equalities.
  if (diffs.isEmpty()) {
    return;
  }
  boolean changes = false;
  Stack<Diff> equalities = new Stack<Diff>();  // Stack of qualities.
  String lastequality = null;  // Always equal to equalities.lastElement().text
  ListIterator<Diff> pointer = diffs.listIterator();
  // Number of characters that changed prior to the equality.
  int length_changes1 = 0;
  // Number of characters that changed after the equality.
  int length_changes2 = 0;
  Diff thisDiff = pointer.next();
  while (thisDiff != null) {
    if (thisDiff.operation == Operation.EQUAL) {
      // equality found
      equalities.push(thisDiff);
      length_changes1 = length_changes2;
      length_changes2 = 0;
      lastequality = thisDiff.text;
    } else {
      // an insertion or deletion
      length_changes2 += thisDiff.text.length();
      if (lastequality != null && (lastequality.length() <= length_changes1)
          && (lastequality.length() <= length_changes2)) {
        //System.out.println("Splitting: '" + lastequality + "'");
        // Walk back to offending equality.
        while (thisDiff != equalities.lastElement()) {
          thisDiff = pointer.previous();
        }
        pointer.next();

// Replace equality with a delete.
        pointer.set(new Diff(Operation.DELETE, lastequality));
        // Insert a corresponding an insert.
        pointer.add(new Diff(Operation.INSERT, lastequality));

equalities.pop();  // Throw away the equality we just deleted.
        if (!equalities.empty()) {
          // Throw away the previous equality (it needs to be reevaluated).
          equalities.pop();
        }
        if (equalities.empty()) {
          // There are no previous equalities, walk back to the start.
          while (pointer.hasPrevious()) {
            pointer.previous();
          }
        } else {
          // There is a safe equality we can fall back to.
          thisDiff = equalities.lastElement();
          while (thisDiff != pointer.previous()) {
            // Intentionally empty loop.
          }
        }

length_changes1 = 0;  // Reset the counters.
        length_changes2 = 0;
        lastequality = null;
        changes = true;
      }
    }
    thisDiff = pointer.hasNext() ? pointer.next() : null;
  }

if (changes) {
    diff_cleanupMerge(diffs);
  }
}

See earlier code for 'diff_cleanupMerge' function.

This solution is not perfect. It has tunnel vision; it is unable to see beyond the immediate neighbourhood of each equality it evaluates. This can result in small groups of chaff surviving:

Text 1: It was a dark and stormy night.

Text 2: The black can in the cupboard.

Diff:   ItThe wblas a darck cand sin the cupboarmy nightd.

Split:  ItThe  wblaas a darck cand  sin tthe cupbooarrmy nightd.

Merge:  It was a darThe black cand stormy night in the cupboard.

A more comprehensive solution might compute a weighted average of differences further away from the equality in question.

3.2.2 Semantic Allignment

A separate issue with creating semantically meaningful diffs is aligning edit boundaries to logical divisions. Consider the following diffs:

Text 1: That cartoon.

Text 2: That cat cartoon.

Diff 1: That cat cartoon.

Diff 2: That cat cartoon.

Diff 3: That cat cartoon.

Diff 4: That cat cartoon.

Diff 5: That cat cartoon.

Diff 6: That cat cartoon.

All six diffs are valid and minimal. Diffs 1 and 6 are the ones most likely to be returned by diff algorithms. But diffs 3 and 4 are more likely to capture the semantic meaning of the diff.

The solution is to locate each insertion or deletion which is surrounded on both sides by equalities, and attempt to slide them sideways. If the last token of the preceeding equality equals the last token of the edit, then the edit may be slid left. Likewise if the first token of the edit equals the first token of the following equality, then the edit may be slit right. Each of the possible locations can be scored based on whether the boundaries appear to be logical. One scheme which works is:

One point if a boundary is adjacent to a non-alphanumeric character.
Two points if a boundary is adjacent to whitespace.
Three points if a boundary is adjacent to a line break.
Four points if a boundary is adjacent to a blank line.
Five points if a boundary has consumed the entire equality.

This scheme would give scores of zero to diffs 1, 2, 5 and 6, while giving scores of four to diffs 3 and 4.

Sample Code

function diff_cleanupSemanticLossless(diffs) {
  // Look for single edits surrounded on both sides by equalities
  // which can be shifted sideways to align the edit to a word boundary.
  // e.g: The c<ins>at c</ins>ame. -> The <ins>cat </ins>came.
  // Define some regex patterns for matching boundaries. 
  var punctuation = /[^a-zA-Z0-9]/;
  var whitespace = /\s/;
  var linebreak = /[\r\n]/;
  var blanklineEnd = /\n\r?\n$/;
  var blanklineStart = /^\r?\n\r?\n/;

function diff_cleanupSemanticScore(one, two) {
    // Given two strings, compute a score representing whether the internal
    // boundary falls on logical boundaries.
    // Scores range from 5 (best) to 0 (worst).
    // Closure, makes reference to regex patterns defined above.
    if (!one || !two) {
      // Edges are the best.
      return 5;
    }

// Each port of this function behaves slightly differently due to
    // subtle differences in each language's definition of things like
    // 'whitespace'.  Since this function's purpose is largely cosmetic,
    // the choice has been made to use each language's native features
    // rather than force total conformity.
    var score = 0;
    // One point for non-alphanumeric.
    if (one.charAt(one.length - 1).match(punctuation) ||
        two.charAt(0).match(punctuation)) {
      score++;
      // Two points for whitespace.
      if (one.charAt(one.length - 1).match(whitespace) ||
          two.charAt(0).match(whitespace)) {
        score++;
        // Three points for line breaks.
        if (one.charAt(one.length - 1).match(linebreak) ||
            two.charAt(0).match(linebreak)) {
          score++;
          // Four points for blank lines.
          if (one.match(blanklineEnd) || two.match(blanklineStart)) {
            score++;
          }
        }
      }
    }
    return score;
  }

var pointer = 1;
  // Intentionally ignore the first and last element (don't need checking).
  while (pointer < diffs.length - 1) {
    if (diffs[pointer - 1][0] == DIFF_EQUAL &&
        diffs[pointer + 1][0] == DIFF_EQUAL) {
      // This is a single edit surrounded by equalities.
      var equality1 = diffs[pointer - 1][1];
      var edit = diffs[pointer][1];
      var equality2 = diffs[pointer + 1][1];

// First, shift the edit as far left as possible.
      var commonOffset = diff_commonSuffix(equality1, edit);
      if (commonOffset) {
        var commonString = edit.substring(edit.length - commonOffset);
        equality1 = equality1.substring(0, equality1.length - commonOffset);
        edit = commonString + edit.substring(0, edit.length - commonOffset);
        equality2 = commonString + equality2;
      }

// Second, step character by character right, looking for the best fit.
      var bestEquality1 = equality1;
      var bestEdit = edit;
      var bestEquality2 = equality2;
      var bestScore = diff_cleanupSemanticScore(equality1, edit) +
          diff_cleanupSemanticScore(edit, equality2);
      while (edit.charAt(0) === equality2.charAt(0)) {
        equality1 += edit.charAt(0);
        edit = edit.substring(1) + equality2.charAt(0);
        equality2 = equality2.substring(1);
        var score = diff_cleanupSemanticScore(equality1, edit) +
            diff_cleanupSemanticScore(edit, equality2);
        // The >= encourages trailing rather than leading whitespace on edits.
        if (score >= bestScore) {
          bestScore = score;
          bestEquality1 = equality1;
          bestEdit = edit;
          bestEquality2 = equality2;
        }
      }

if (diffs[pointer - 1][1] != bestEquality1) {
        // We have an improvement, save it back to the diff.
        if (bestEquality1) {
          diffs[pointer - 1][1] = bestEquality1;
        } else {
          diffs.splice(pointer - 1, 1);
          pointer--;
        }
        diffs[pointer][1] = bestEdit;
        if (bestEquality2) {
          diffs[pointer + 1][1] = bestEquality2;
        } else {
          diffs.splice(pointer + 1, 1);
          pointer--;
        }
      }
    }
    pointer++;
  }
}

public void diff_cleanupSemanticLossless(LinkedList<Diff> diffs) {
  // Look for single edits surrounded on both sides by equalities
  // which can be shifted sideways to align the edit to a word boundary.
  // e.g: The c<ins>at c</ins>ame. -> The <ins>cat </ins>came.
  String equality1, edit, equality2;
  String commonString;
  int commonOffset;
  int score, bestScore;
  String bestEquality1, bestEdit, bestEquality2;
  // Create a new iterator at the start.
  ListIterator<Diff> pointer = diffs.listIterator();
  Diff prevDiff = pointer.hasNext() ? pointer.next() : null;
  Diff thisDiff = pointer.hasNext() ? pointer.next() : null;
  Diff nextDiff = pointer.hasNext() ? pointer.next() : null;
  // Intentionally ignore the first and last element (don't need checking).
  while (nextDiff != null) {
    if (prevDiff.operation == Operation.EQUAL &&
        nextDiff.operation == Operation.EQUAL) {
      // This is a single edit surrounded by equalities.
      equality1 = prevDiff.text;
      edit = thisDiff.text;
      equality2 = nextDiff.text;

// First, shift the edit as far left as possible.
      commonOffset = diff_commonSuffix(equality1, edit);
      if (commonOffset != 0) {
        commonString = edit.substring(edit.length() - commonOffset);
        equality1 = equality1.substring(0, equality1.length() - commonOffset);
        edit = commonString + edit.substring(0, edit.length() - commonOffset);
        equality2 = commonString + equality2;
      }

// Second, step character by character right, looking for the best fit.
      bestEquality1 = equality1;
      bestEdit = edit;
      bestEquality2 = equality2;
      bestScore = diff_cleanupSemanticScore(equality1, edit)
          + diff_cleanupSemanticScore(edit, equality2);
      while (edit.length() != 0 && equality2.length() != 0
          && edit.charAt(0) == equality2.charAt(0)) {
        equality1 += edit.charAt(0);
        edit = edit.substring(1) + equality2.charAt(0);
        equality2 = equality2.substring(1);
        score = diff_cleanupSemanticScore(equality1, edit)
            + diff_cleanupSemanticScore(edit, equality2);
        // The >= encourages trailing rather than leading whitespace on edits.
        if (score >= bestScore) {
          bestScore = score;
          bestEquality1 = equality1;
          bestEdit = edit;
          bestEquality2 = equality2;
        }
      }

if (!prevDiff.text.equals(bestEquality1)) {
        // We have an improvement, save it back to the diff.
        if (bestEquality1.length() != 0) {
          prevDiff.text = bestEquality1;
        } else {
          pointer.previous(); // Walk past nextDiff.
          pointer.previous(); // Walk past thisDiff.
          pointer.previous(); // Walk past prevDiff.
          pointer.remove(); // Delete prevDiff.
          pointer.next(); // Walk past thisDiff.
          pointer.next(); // Walk past nextDiff.
        }
        thisDiff.text = bestEdit;
        if (bestEquality2.length() != 0) {
          nextDiff.text = bestEquality2;
        } else {
          pointer.remove(); // Delete nextDiff.
          nextDiff = thisDiff;
          thisDiff = prevDiff;
        }
      }
    }
    prevDiff = thisDiff;
    thisDiff = nextDiff;
    nextDiff = pointer.hasNext() ? pointer.next() : null;
  }
}

private int diff_cleanupSemanticScore(String one, String two) {
  // Given two strings, compute a score representing whether the internal
  // boundary falls on logical boundaries.
  // Scores range from 5 (best) to 0 (worst).
  if (one.length() == 0 || two.length() == 0) {
    // Edges are the best.
    return 5;
  }

// Each port of this function behaves slightly differently due to
  // subtle differences in each language's definition of things like
  // 'whitespace'.  Since this function's purpose is largely cosmetic,
  // the choice has been made to use each language's native features
  // rather than force total conformity.
  int score = 0;
  // One point for non-alphanumeric.
  if (!Character.isLetterOrDigit(one.charAt(one.length() - 1))
      || !Character.isLetterOrDigit(two.charAt(0))) {
    score++;
    // Two points for whitespace.
    if (Character.isWhitespace(one.charAt(one.length() - 1))
        || Character.isWhitespace(two.charAt(0))) {
      score++;
      // Three points for line breaks.
      if (Character.getType(one.charAt(one.length() - 1)) == Character.CONTROL
          || Character.getType(two.charAt(0)) == Character.CONTROL) {
        score++;
        // Four points for blank lines.
        if (BLANKLINEEND.matcher(one).find()
            || BLANKLINESTART.matcher(two).find()) {
          score++;
        }
      }
    }
  }
  return score;
}

private Pattern BLANKLINEEND
    = Pattern.compile("\\n\\r?\\n\\Z", Pattern.DOTALL);
private Pattern BLANKLINESTART
    = Pattern.compile("\\A\\r?\\n\\r?\\n", Pattern.DOTALL);

def diff_cleanupSemanticLossless(diffs):
  # Look for single edits surrounded on both sides by equalities
  # which can be shifted sideways to align the edit to a word boundary.
  # e.g: The c<ins>at c</ins>ame. -> The <ins>cat </ins>came.

def diff_cleanupSemanticScore(one, two):
    # Given two strings, compute a score representing whether the
    # internal boundary falls on logical boundaries.
    # Scores range from 5 (best) to 0 (worst).
    # Closure, but does not reference any external variables.

if not one or not two:
      # Edges are the best.
      return 5

# Each port of this function behaves slightly differently due to
    # subtle differences in each language's definition of things like
    # 'whitespace'.  Since this function's purpose is largely cosmetic,
    # the choice has been made to use each language's native features
    # rather than force total conformity.
    score = 0
    # One point for non-alphanumeric.
    if not one[-1].isalnum() or not two[0].isalnum():
      score += 1
      # Two points for whitespace.
      if one[-1].isspace() or two[0].isspace():
        score += 1
        # Three points for line breaks.
        if (one[-1] == "\r" or one[-1] == "\n" or
            two[0] == "\r" or two[0] == "\n"):
          score += 1
          # Four points for blank lines.
          if (re.search("\\n\\r?\\n$", one) or
              re.match("^\\r?\\n\\r?\\n", two)):
            score += 1
    return score

pointer = 1
  # Intentionally ignore the first and last element (don't need checking).
  while pointer < len(diffs) - 1:
    if (diffs[pointer - 1][0] == DIFF_EQUAL and
        diffs[pointer + 1][0] == DIFF_EQUAL):
      # This is a single edit surrounded by equalities.
      equality1 = diffs[pointer - 1][1]
      edit = diffs[pointer][1]
      equality2 = diffs[pointer + 1][1]

# First, shift the edit as far left as possible.
      commonOffset = diff_commonSuffix(equality1, edit)
      if commonOffset:
        commonString = edit[-commonOffset:]
        equality1 = equality1[:-commonOffset]
        edit = commonString + edit[:-commonOffset]
        equality2 = commonString + equality2

# Second, step character by character right, looking for the best fit.
      bestEquality1 = equality1
      bestEdit = edit
      bestEquality2 = equality2
      bestScore = (diff_cleanupSemanticScore(equality1, edit) +
          diff_cleanupSemanticScore(edit, equality2))
      while edit and equality2 and edit[0] == equality2[0]:
        equality1 += edit[0]
        edit = edit[1:] + equality2[0]
        equality2 = equality2[1:]
        score = (diff_cleanupSemanticScore(equality1, edit) +
            diff_cleanupSemanticScore(edit, equality2))
        # The >= encourages trailing rather than leading whitespace on edits.
        if score >= bestScore:
          bestScore = score
          bestEquality1 = equality1
          bestEdit = edit
          bestEquality2 = equality2

if diffs[pointer - 1][1] != bestEquality1:
        # We have an improvement, save it back to the diff.
        if bestEquality1:
          diffs[pointer - 1] = (diffs[pointer - 1][0], bestEquality1)
        else:
          del diffs[pointer - 1]
          pointer -= 1
        diffs[pointer] = (diffs[pointer][0], bestEdit)
        if bestEquality2:
          diffs[pointer + 1] = (diffs[pointer + 1][0], bestEquality2)
        else:
          del diffs[pointer + 1]
          pointer -= 1
    pointer += 1

See an implementation and online demonstration of diff.
See also the companion paper on diff's counterpart: Patch

Neil Fraser: Writing: Diff Strategies

Last modified: 15 August 2008