Monday, January 24, 2005

More on String Comparisons

Today I added the 'Double metaphone' algorithm to my string comparison toolkit. It's an enhancement to the 'Soundex' function found in many languages today. The main difference in my implementation is that it allows you to determine the maximum length of the key generated.

Using this, I calculate a key for each string, the use the LD algorithm to calc the 'distance' between the keys and get a difference percentage, I then multiply this percentage against the value my LD algorithm creates.

It's had the effect of 'smoothing' the results and making the outcome more predictable. Before, it was possible to have a ‘Percentage of Likeness’ number of 95% or better, yet have the strings be different enough as to be truly different. Now, on those same two strings the number drops to the low 80’s, a number low enough to indicate a human ‘eyeball’ is required.

I’ve also uncovered a new paper on “String Comparison Techniques” that discusses some very sophisticated algorithms to equate the ‘distance’ between two strings. I’ll be talking with the client tomorrow to see how far they want me to take this. I’m hoping they’ll let me run with this, it’s been a fun journey, and I see some very big potential for the knowledge downstream.

2 comments:

Anonymous said...

Hi,

You mention you you found some good informtion on different techniques to compare strings.
For a project I am working on I need to restructure 160 sorting codes, in such a way that when the codes are next to each other, they are visually as different as possible.
Could you point me in the direction of the article you found? It would be a great help, as I think normal edit distance algorithms are not sufficient for this task.

Thanks
Bastiaan (bverhage at gmail.com)

Bill said...

Bastiaan, I sent you an email privately. I hope that helps!