9 October 2007
The common prefix and suffix algorithms in my Diff, Match and Patch library uses a binary search of substrings to locate the divergent point. I claim this to be O(log n) instead of O(n). In the year since this was first published I've been taking a lot of heat from programmers who look at the algorithm and realize what the substring operation is doing at the processor level. They write back, call me up, or burst our laughing when they realize that my 'optimization' actually evaluates to O(n log n). I patiently explain that high-level languages don't work that way, but the skeptics remain unconvinced.
Fine, let's look at some data. Here's a bake-off between the standard linear O(n) approach and my unexpectedly controversial so-called O(log n) binary search:
And I've helpfully graphed the output on a logarithmic scale:
Update: Wondering how Python compares? bin_vs_lin.py reports that the binary approach is significantly faster than the linear approach when graphed against each other. The binary approach in Python is about the same speed as either approach in Java.
Update: Wondering how C# compares? bin_vs_lin.cs reports that the linear approach is significantly faster than the binary approach when graphed against each other. Unlike any other language tested so far, C# dies with an out of memory exception when generating a string of 1,000,000 characters. Both Microsoft's C# and the Mono compiler behaved the same.
Update: This algorithmic analysis has generated a lot of interest. A Google engineer suggested the fastest algorithm yet:
"A cryptographer would tell you to call text1 == text2 and measure the time it takes to return an answer."While an engineer from another search engine company suggested the slowest algorithm yet:
var text = text1 + '\x00' + text2;