Neil's News

Diff Accelerators

12 March 2006

Note:
The content on this page has been superseded: Diff Strategies

Computing the differences between two pieces of data is a surprisingly difficult and time consuming task. One good algorithm is described in An O(ND) Difference Algorithm and Its Variations. However, regardless of the algorithm used, it will always be a computationally painful process.

Here are four shortcuts which can significantly accelerate the process:

  1. Equality
    In most real world usage of diff there is a non-trivial chance that the two texts are identical.
      text1: The cat in the hat.
      text2: The cat in the hat.
    Therefore it makes sense to perform a trivial test for this:
      if (text1 == text2)
        return null;
    One side-effect of this test is that it may simplify subsequent code. One knows beyond a doubt that after this test, there will be a difference. The null case is eliminated.

  2. Common Prefix/Suffix
    If there is any commonality at all between the texts, it is likely that they will share a common substring at the start and/or the end.
      text1: The cat in the hat.
      text2: The dog in the hat.
    This can be simplified down to:
      text1: cat
      text2: dog
    Locating these common substrings can be done very efficiently using a binary search:
      // Trim off common prefix
      var pointermin = 0;
      var pointermax = Math.min(text1.length, text2.length);
      var pointermid = pointermax;
      while(pointermin < pointermid) {
        if (text1.substring(0, pointermid) == text2.substring(0, pointermid))
          pointermin = pointermid;
        else
          pointermax = pointermid;
        pointermid = Math.floor((pointermax - pointermin) / 2 + pointermin);
      }
      var commonprefix = text1.substring(0, pointermid);
      text1 = text1.substring(pointermid);
      text2 = text2.substring(pointermid);
    
      // Trim off common suffix
      pointermin = 0;
      pointermax = Math.min(text1.length, text2.length);
      pointermid = pointermax;
      while(pointermin < pointermid) {
        if (text1.substring(text1.length-pointermid) == text2.substring(text2.length-pointermid))
          pointermin = pointermid;
        else
          pointermax = pointermid;
        pointermid = Math.floor((pointermax - pointermin) / 2 + pointermin);
      }
      var commonsuffix = text1.substring(text1.length-pointermid);
      text1 = text1.substring(0, text1.length-pointermid);
      text2 = text2.substring(0, text2.length-pointermid);
  3. Singular Insertion/Deletion
    A common difference is the insertion or the deletion of some text:
      text1: The cat in the hat.             |  text1: The cat in the hat.
      text2: The furry cat in the hat.       |  text2: The cat.
    After removing the common prefixes and suffixes one gets:
      text1:                                 |  text1:  in the hat
      text2: furry                           |  text2: 
    The presence of an empty text1 in the first example indicates that text2 is an addition. The presence of an empty text2 in the second example indicates that text1 is a deletion. A couple of 'if' statements can detect these cases and avoid any need for running a diff algorithm.

  4. Two Edits
    Sometimes one might have two edits which are separated by considerable text:
      text1: The cat in the hat.
      text2: The hat in the cat.
    After removing the common prefixes and suffixes one gets:
      text1: cat in the h
      text2: hat in the c
    If a substring exists in both texts which is at least half the length of the longer text, then it is guaranteed to be common. In this case the texts can be split in two, and separate diffs carried out:
      text1: c     |  text1: h
      text2: h     |  text2: c
    At this point one would want to pass both of these diffs back through accelerators 3 & 4 (though in this example they wouldn't help). Determining whether such a common substring exists is not trivial, though if it does exist it represents a great savings in the subsequent diff computations.

Surprisingly none of these accelerators are mentioned in any of the papers I've read regarding diff algorithms. I wonder if there are any more significant accelerators. 'Significant' being the key word, since one could go on for hours trapping subtler and subtler changes. It would be nice to have some sort of standard against which to measure the relative values of these accelerators. Unfortunately that is not possible since each application of diff is used very differently.

Update: A working implementation in JavaScript is now available.

< Previous | Next >

 
-------------------------------------
Legal yada yada: My views do not necessarily represent those of my employer or my goldfish.