Neil Fraser: News: Overlap Detection

Overlap Detection

4 November 2010

How does one detect when two strings overlap? In the case below, the four-letter suffix of string 1 matches the four-letter prefix of string 2.

1: "Fire at Will"
2:         "William Riker is number one"

Sometimes there are several matches; finding the longest is not always straight forward.

1: "Have some CoCo and CoCo"
2:                      "CoCo and CoCo is here."
2:                    "CoCo and CoCo is here."
2:           "CoCo and CoCo is here."

The naïve solution is to take ever smaller substrings of each string, compare them, then bail out when the first match is found. This is quick to program, very reliable, and very inefficient for long strings.

A more efficient solution is to use a modified version of the Knuth-Morris-Pratt algorithm to scan the strings while keeping track of partial matches. This is very tricky to program, with many opportunities for off-by-one errors.

def commonOverlapKmp(text1, text2):
  # Cache the text lengths to prevent multiple calls.
  text1_length = len(text1)
  text2_length = len(text2)
  # Eliminate the null case.
  if text1_length == 0 or text2_length == 0:
    return 0
  # Truncate the longer string.
  if text1_length > text2_length:
    text1 = text1[-text2_length:]
  elif text1_length < text2_length:
    text2 = text2[:text1_length]
  text_length = min(text1_length, text2_length)
  # Quick check for the worst case.
  if text1 == text2:
    return text_length
 
  # Build partial match table from text2.
  table = [0] * text_length
  table[0] = -1
  #table[1] = 0
  pos = 2
  cnd = 0
  while pos < text_length:
    if text2[pos - 1] == text2[cnd]:
      cnd += 1
      table[pos] = cnd
      pos += 1
    elif cnd > 0:
      cnd = table[cnd]
    else:
      table[pos] = 0
      pos += 1
 
  # Search text1.
  m = 0
  i = 0
  while m + i < text_length:
    if text2[i] == text1[m + i]:
      i += 1
      if m + i == text_length:
        return i
    else:
      m += i - table[i]
      if table[i] > -1:
        i = table[i]
      else:
        i = 0
  return 0

A somewhat simpler approach is to leverage the highly efficient indexOf function that is built into most languages (it is called find in Python). Start by assuming a single letter overlap and search for that letter in the second string. If found, then check the two substrings for equality. Then use indexOf to locate any instance of the substring plus one character. Keep repeating until no matches are found, then return the last confirmed substring match.

Now that we have three functions that all do the same thing, which one is faster? Well, that depends. Let's feed random strings to each function, like this:
cAx&"[|J{aL[xJu081e:(grxnV`kOOe#&`y#AxfA/;o2~WVE1qMUVqk~ ^]>...
The resulting logarithmic timing plots are fairly clear. The naïve algorithm scales worse than O(n), though it beats KMP until string lengths reach 10,000. KMP is solidly O(n) as advertised. The IndexOf algorithm is also O(n) -- but one hundred times faster than KMP.

[Graph of timings on overlap detection in random strings]

However, the IndexOf algorithm relies on there not being too many coincidental matches. Let's see what happens when we feed it a pathological input string, like this:
baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
The resulting logarithmic timing plots show a sudden change in behaviour. The naïve and KMP algorithms are unchanged, while IndexOf becomes the worst of the bunch.

[Graph of timings on overlap detection in pathological strings]

Now we have a problem. The best algorithm can become the worst algorithm, depending on the inputs. KMP offers a consistently scalable solution regardless of input, but potentially at a cost of 100x performance. Let's take a closer look at the inputs. Can we recreate the pathological behaviour with real-world data? Let's start by creating fractal strings with lots of repetitions, like this:
N>NeN>N`N>NeN>NVN>NeN>N`N>NeN>NRN>NeN>N`N>NeN>NVN>NeN>N`N>Ne...
This is similar to patterns found in source code. The resulting logarithmic timing plots shows that performance returns to near-optimum levels.

$[Graph of timings on overlap detection in fractal strings]$

A major user of string manipulation utilities these days is the genetics world. DNA has an alphabet of four letters, A, C, G and T. This will certainly result in large numbers of coincidental matches. Let's download the human genome, like this:
TCGGTCTCCTTTGGGTAATTTTCCATTATGTCATAACAGTAAATATTAATATGTGCTCCT...
The resulting logarithmic timing plots once again show this as near-optimum. I fed it a fifth of a human and got an answer back in fifteen seconds.

[Graph of timings on overlap detection in DNA]

Computer Science (always suspect any discipline which feels the need to add the word 'science' to its name) is often an exercise in compromises. Does one choose the algorithm that works best most of the time (IndexOf)? Does one choose the algorithm that will never fail badly (KMP)? Does one choose the algorithm that's easiest to program and least likely to contain bugs (Naïve)? Does one spawn off two threads each running different algorithms to execute on separate processors with the winner terminating the loser?

Special thanks to Tancred Lindholm for continually asking awkward questions every time I thought I was finished.

< Previous | Next >

Neil's News

Overlap Detection