30 June 2009
One of the features of my Diff Match Patch library is that when presented with a computationally intensive diff the algorithm has the option of bailing out and returning "delete all of text 1, insert all of text 2". The result may be non-minimal, but it is none-the-less a completely valid diff. And usually this is the diff one wants anyway if the texts are unrelated. The library contains a unit test for this case: I set the timeout to be 1 millisecond then diff the opening stanzas of Jabberwocky and the Major General song.
The inevitable happened last week. This test started intermittently "failing" as the new C# version of the library managed to compute the full diff in under the allotted time. My solution was to multiply the two input strings by 1024 times. This sparked an interesting conversation during the code review regarding how long this will stave off the inevitable. Moore's law[?] implies that 10 doublings in processing power will have occurred by 2025. The debate centred around whether this would come to pass, and whether massively multi-cored CPUs would be useful for a largely serial diff algorithm.
Eventually my friend Tancred Lindholm and I layed down a challenge:
I bet a pizza (or future equivalent) that my diff library will be able to map a 200kb diff in less than 1ms on normal hardware by 1 January 2025.
In other words, I need to speed up the diff by 1000 times over the next 15 years. The biggest factor in the outcome of this bet is of course the performance improvements of future CPUs. Also important are optimizations in future compilers. Both of these I get for free, just by waiting. More interesting are are the things within my control. First is a need to re-port the library into C++. The existing C++ port was done by a 3rd party who used the convenient but horrifically slow Qt library instead of the STL. Second I need to rewrite the mapping algorithm to take advantage of two threads running concurrently on separate cores, then continue parallelization efforts to match CPU design. Finally I need to keep maintaining the library as technology evolves. If for example we switch to quantum processors, I want the diff to continue to run natively, not inside Microsoft Turing Emulator 2025™.
For reference, here is the offending unit test which I want to start "failing":
dmp.Diff_Timeout = 0.001; // 1ms
Will code for pizza.
Bug swarm: Here are 800 software errors in published code. It's too bad there's no simple way to file bugs on all of those. I noticed this class of error in my own code when a random ID was generated as "aaaaaaaa". Which technically is just as random as "d45ukqp4". Obligatory XKCD and Dilbert comics.