Neil's News

Fuzzy Match

10 May 2006

The word "elifund" is more likely to be a misspelling of "elephant" than "refund". To make this connection, the Soundex algorithm will hash both "elifund" and "elephant" down to "e415", whereas "refund" becomes "r153". Soundex (and its more advanced cousins) is reasonably good at reconciling words which the author genuinely doesn't know how to spell.

However, it is very poor at matching mistakes due to typos, data corruption, or other non-accoustic errors. Matching something as simple as "telephant" to "elephant" is beyond its abilities. To make this connection, computing the Levenshtein Distance will reveal that those two words differ by only one letter.

The next problem is how to find "telephant" in the string "Is that an elephant in my soup?" Computing the Levensthein distance for each of the 22 possible substrings becomes an O(n3) operation, which would quickly become boring. Fortunately the Bitap algorithm speeds this task up using some very clever bitwise magic.

So far, nothing new. The final problem is how to find "telephant" in the string "Is that an elephant in my soup?" if you have a reasonably good idea that the match should be somewhere around the 8th character. This is the problem encountered by the patch program. It is given a string to look for, and a line number to aim at. If the results don't match then it just has to do the best job it can to find the closest match. Unfortunately patch does not do a particularly good job. I've been working to improve its accuracy.


Fuzzy Pattern:

Fuzzy Location:

< Previous | Next >

Legal yada yada: My views do not necessarily represent those of my employer or my goldfish.