Wednesday, June 11, 2008

Soundex, metaphone, etc...

Recently I was considering fuzzy-search in my application. I needed a simple fuzzy-search on some band names. Although pgsql fuzzy searching is possible I checked out some other possibilities. And I found them.

Soundex
Soundex is an algorithm which tries to connect words which share the same (or very similar) pronunciation. The computed value looks like: R546 or W422. First letter is always the first letter of a word and three digits are generated using simple rules (removing vowels, converting similar consonants to one number). Soundex produces fixed-size four-letter output so it's not very precise (and it's good actually!).

Metaphone
After finding soundex it was only a matter of time to find other approaches one of which is Metaphone. It was created to generate more accurate sounds than Soundex (Metaphone is variable length). Indeed, it generated longer output, but being too accurate is in my case not so good. When fuzzy-searching a band I want to allow relatively big number of typos since band-names are quite unique already.

When it comes to other algorithms, there is Double Metaphone, D--M-Soundex but first two were good enough.

After adding columns to the DB I tried searching the bands and soundex results were better ('better' is quite an odd word here ;P) than metaphone's - the result of soundex being only 4 chars length - fuzzier than metaphone. I had to add some minor fixes (removing weird chars like ":", "the"'s, "der"'s and similar to have the same results for "The Band" and "Band" which seems reasonable).

I'm quite satisfied with the results as I didn't have to add some extensions to postgres (which, by the way, *are* available) and I have the funcionality I needed, simple and clear.

0 comment(s):