Subject Re: [IBO] Soundex and SoundexMax...
Author Dany M
Geoff Worboys wrote:
> Hi Dany,
>> If anyone knows, I'd appreciate an answer. Why does
>> SoundexMax deliver fewer hits then Soundex?
> That is one of my inventions... I have been known to use
> misleading names - despite trying to be clear. :-)

I'm not sure *you* mislead me... See below.

> SoundexMax is actually meant to be used in conjunction
> with Soundex - it is not intended to be used on its own.
> This is from the help for TIB_Connection.OnSoundExMaxParse:
> "
> This is NOT a standard SoundEx capability, so if your
> algorithm does not support this simply leave this property
> undefined - and the OnSoundExParse property will be used in
> its place, giving you the same result as usual. If your
> algorithm does support range selection then, if the NOTRAILING
> attribute is defined on a search column the SQL generated will
> select WHERE Fld_SX >= SoundExValue AND Fld_SX <= SoundExMaxValue.
> For more information refer to the TC_SoundEx and TC_SoundExMax
> routines provided in the IB_Parse unit.
> "

I'm positively baffled that there is help on this topic! I usually read
the sources and I only saw the comments om TC_SoundEx and TC-SoundExMax,
not the "event" portion :)

> In actual fact the algorithms mentioned got moved to ib_utils.
> Standard SoundEx (as implemented in FreeUDF) uses only a 16bit
> integer value and represents only 4 or 5 characters from the
> input string. The functions I wrote (TC_SoundEx and
> TC_SoundExMax) extend the algorithm to a 32 bit integer value
> and support 8 or 9 characters from the input string - and this
> allows a more refined search (and better index selectivity in
> large tables).
> SoundExMax comes in when you want to support searching for
> strings that sound like they "start with" some given (brief)
> input string. This lets a user type in only a few characters
> and still get a good result from the more precise 32bit
> processing.
> For example:
> If I type in "Smithsonian" then the 32bit versions return:
> TC_SoundEx = 1403094056
> TC_SoundExMax = 1403094056
> They both give the same value because the SoundEx is getting
> filled from the input string (would be the same even if the
> name were longer - eg: "Smithsonianstuff").
> But what happens if the user only inputs "Smith"?
> TC_SoundEx = 1403092992
> TC_SoundExMax = 1403097087
> See what is happening here? TC_SoundEx is giving you the
> actual (32bit version) SoundEx for "Smith". But as you can
> see looking at the previous example the SoundEx for "Smith"
> does not match the SoundEx for "Smithsonian" and so it would
> not be found. "Smith" does _not_ sound like it "Smithsonian",
> so that is to be expected and desired.
> But what if I want to find all names that sound like they start
> with "Smith" (and so want "Smithsonian")? Then you do a
> search like this (automatically built by IBO if you define a
> return from TIB_Connection.OnSoundExMaxParse):
> SOUNDEX_FIELD <= TC_SoundExMax('Smith')
> This works because SoundExMax calculates the largest possible
> SoundEx result given a smaller input string. Hence the name
> "SoundExMax". The "Max" refers to the soundex value, not to
> the number of results you can expect.
> I found that SoundExMax was necessary for the new 32bit version
> of SoundEx because many names are shorter than 8 or 9 chars,
> and so good results meant using the range select if you wanted
> "starting with" type semantics. The old 16bit soundex did not
> suffer from this very much, because most names are at least 4
> or 5 chars in length - and so the 16bit SoundEx for "Smith" and
> "Smithsonian" are likely to be the same.
> Much of this will be irrelevant if you are using the FreeUDF
> 16bit version of SoundEx - since you will also need to call
> that function from the TIB_Connection.OnSoundExParse event and
> it does not have a "max" variation AFAIK.
> I hope that explains it well enough.
> (I know what I mean ;-)

Your explanation is eloquent and leaves nothing else to desire. It makes
perfect sence. Two things got me confused; the first is that I do not
(never have) use the built-in search capabilities of IBO so the
NOTRAILING and similar attributes isn't in my vocabulary.

Secondly the way Jason included SoundexMax in the IB_FTS utility. It
does all searches using a join so the BETWEEN concept isn't there. The
FTS creates an index using SoundExMax (would be SOUNDEX_MAX_FIELD in
your example above) and compares it directly to the search term (also
SoundExMax). After reading your explanation that seems wrong and also
very unnecessary to index using SoundExMax.

Jason, if you are there and would care to comment?

Regards and thanks,