Monday, January 24, 2005

More on String Comparisons

Today I added the 'Double metaphone' algorithm to my string comparison toolkit. It's an enhancement to the 'Soundex' function found in many languages today. The main difference in my implementation is that it allows you to determine the maximum length of the key generated.

Using this, I calculate a key for each string, the use the LD algorithm to calc the 'distance' between the keys and get a difference percentage, I then multiply this percentage against the value my LD algorithm creates.

It's had the effect of 'smoothing' the results and making the outcome more predictable. Before, it was possible to have a ‘Percentage of Likeness’ number of 95% or better, yet have the strings be different enough as to be truly different. Now, on those same two strings the number drops to the low 80’s, a number low enough to indicate a human ‘eyeball’ is required.

I’ve also uncovered a new paper on “String Comparison Techniques” that discusses some very sophisticated algorithms to equate the ‘distance’ between two strings. I’ll be talking with the client tomorrow to see how far they want me to take this. I’m hoping they’ll let me run with this, it’s been a fun journey, and I see some very big potential for the knowledge downstream.

Saturday, January 22, 2005

Levenshtein and other stuff

Another week on the contract, and a busy one at that. We had a big meeting with their larger customers and the concept of 'closely matching' text strings was raised.

The original consensus was that it wasn't 'real world possible'... well those of you that know me, know that's a challenge I couldn't resist!

Long story short I wrote a Visual FoxPro (as well as a VB.Net) implementation of the Levenshtein algorithm and modified the return value such that it returned what I've been terming the 'Percentage of Likeness' (POL). The POL allows the user to view that number and based on the level the company has set as an acceptable level of the POL number simply accept the string as 'close enough' and accept it!

Given I returned it to them the morning following the meeting, I got a lot of 'atta-boys' on Friday :) Made for a nice end to a very hectic week.

I'm going to miss this client when the project is finished (end of February at this point) as it's been very challenging, yet they leave me alone to actually produce what they've requested, not at all that common these days. Most projects remind me of the show 'American Hot Rod', where too much is never enough!! This one has high expectations, but delivers the time and resources to actually make it happen!

Next week I'll be sifting through about a quarter of a million customer records using the new algorithm to find items that are potential duplicate entries based on the POL of the address strings... should be very interesting stuff!

-Bill

Thursday, January 06, 2005

All in a day....

Had a meeting today... where they changed the spec on the project..again :)

They added an additional tier of responsibilities.. almost like I was an employee. It's funny sometimes how entrenched a contract person can become, when their time at the company is limited.

Don't get me wrong, I'm flattered by the additional responsibility, and the possiblity that this could extend the contract further, it's certainly easier to extend the one you have than find another!!

I also made a couple of interesting Visual FoxPro discoveries this week, not the least of which was the ability to programmatically create private data sessions! Without a form or form set. This has the potential to cut thousands of lines of 'houskeeping code' from applications!! You simply define the private session, open all the tables/DBC's within that session and when you exit the program you're dropped right back where you were, same table, same position!! Sweet!

I find myself wishing that there was more opportunity on this project to explore the .NET world though. The more I work in that environment, the more impressed, and intrigued with it I am.

I suppose I'll have to keep working on the .Net library and count on the possibility that the next gig will allow me to put it to use!

-Bill

Monday, January 03, 2005

The New Year

Well, it's the start of another New year, and this time I thought I'd actually give this 'blogging' thing a try.

The idea (at least initially) is for me to 'log' the work related events and projects here. The success or failure of the various endeavors, and hopefully learn something as I look back. If my accounts of the software development process help some others along the way, all the better.

As I learn more about this process (blogging) and what this particular site is capable of, I hope to be able to let others share their experiences here as well.

That's it for today as I'm not working and have some work to do in the garage. I'm rebuilding a '78 Chevy pickup... Well 'disassembling' is a better description of where I am in the process at the moment!!