Neil's News

Prefix Matching

9 October 2007

The common prefix and suffix algorithms in my Diff, Match and Patch library uses a binary search of substrings to locate the divergent point. I claim this to be O(log n) instead of O(n). In the year since this was first published I've been taking a lot of heat from programmers who look at the algorithm and realize what the substring operation is doing at the processor level. They write back, call me up, or burst our laughing when they realize that my 'optimization' actually evaluates to O(n log n). I patiently explain that high-level languages don't work that way, but the skeptics remain unconvinced.

Fine, let's look at some data. Here's a bake-off between the standard linear O(n) approach and my unexpectedly controversial so-called O(log n) binary search:

Linear search vs. Binary search
Size:
Results
Linear search:
Binary substring:

And I've helpfully graphed the output on a logarithmic scale:
[Graph of linear vs binary performance.]
Any questions?

Update: Wondering how Java compares? bin_vs_lin.java reports that the binary and linear approaches are virtually identical when graphed against each other. Also interesting is what it looks like when graphed on the same scale as the above JavaScript.

Update: Wondering how Python compares? bin_vs_lin.py reports that the binary approach is significantly faster than the linear approach when graphed against each other. The binary approach in Python is about the same speed as either approach in Java.

Update: Wondering how Lua compares? bin_vs_lin.lua reports that the binary approach is significantly faster than the linear approach when graphed against each other.

Update: Wondering how C# compares? bin_vs_lin.cs reports that the linear approach is significantly faster than the binary approach when graphed against each other. Unlike any other language tested so far, C# dies with an out of memory exception when generating a string of 1,000,000 characters. Both Microsoft's C# and the Mono compiler behaved the same.

Update: This algorithmic analysis has generated a lot of interest. A Google engineer suggested the fastest algorithm yet:

"A cryptographer would tell you to call text1 == text2 and measure the time it takes to return an answer."
While an engineer from another search engine company suggested the slowest algorithm yet:
var text = text1 + '\x00' + text2;
var m = text.match(/^(.*).*\x00\1/);
i = m ? m[1].length : 0;

< Previous | Next >

 
-------------------------------------
Legal yada yada: My views do not necessarily represent those of my employer or my goldfish.