4 November 2010
How does one detect when two strings overlap? In the case below, the four-letter suffix of string 1 matches the four-letter prefix of string 2.
1: "Fire at Will" 2: "William Riker is number one"
Sometimes there are several matches; finding the longest is not always straight forward.
1: "Have some CoCo and CoCo" 2: "CoCo and CoCo is here." 2: "CoCo and CoCo is here." 2: "CoCo and CoCo is here."
The naïve solution is to take ever smaller substrings of each string, compare them, then bail out when the first match is found. This is quick to program, very reliable, and very inefficient for long strings.
A more efficient solution is to use a modified version of the Knuth-Morris-Pratt algorithm to scan the strings while keeping track of partial matches. This is very tricky to program, with many opportunities for off-by-one errors.
A somewhat simpler approach is to leverage the highly efficient indexOf function that is built into most languages (it is called find in Python). Start by assuming a single letter overlap and search for that letter in the second string. If found, then check the two substrings for equality. Then use indexOf to locate any instance of the substring plus one character. Keep repeating until no matches are found, then return the last confirmed substring match.
Now that we have three functions that all do the same thing, which one is faster? Well, that depends. Let's feed random strings to each function, like this:
However, the IndexOf algorithm relies on there not being too many coincidental matches. Let's see what happens when we feed it a pathological input string, like this:
Now we have a problem. The best algorithm can become the worst algorithm, depending on the inputs. KMP offers a consistently scalable solution regardless of input, but potentially at a cost of 100x performance. Let's take a closer look at the inputs. Can we recreate the pathological behaviour with real-world data? Let's start by creating fractal strings with lots of repetitions, like this:
A major user of string manipulation utilities these days is the genetics world. DNA has an alphabet of four letters, A, C, G and T. This will certainly result in large numbers of coincidental matches. Let's download the human genome, like this:
Computer Science (always suspect any discipline which feels the need to add the word 'science' to its name) is often an exercise in compromises. Does one choose the algorithm that works best most of the time (IndexOf)? Does one choose the algorithm that will never fail badly (KMP)? Does one choose the algorithm that's easiest to program and least likely to contain bugs (Naïve)? Does one spawn off two threads each running different algorithms to execute on separate processors with the winner terminating the loser?
Special thanks to Tancred Lindholm for continually asking awkward questions every time I thought I was finished.