Subject | Re: [IBO] Soundex and SoundexMax... |
---|---|
Author | Geoff Worboys |
Post date | 2006-09-21T12:53:10Z |
Hi Dany,
misleading names - despite trying to be clear. :-)
SoundexMax is actually meant to be used in conjunction
with Soundex - it is not intended to be used on its own.
This is from the help for TIB_Connection.OnSoundExMaxParse:
"
This is NOT a standard SoundEx capability, so if your
algorithm does not support this simply leave this property
undefined - and the OnSoundExParse property will be used in
its place, giving you the same result as usual. If your
algorithm does support range selection then, if the NOTRAILING
attribute is defined on a search column the SQL generated will
select WHERE Fld_SX >= SoundExValue AND Fld_SX <= SoundExMaxValue.
For more information refer to the TC_SoundEx and TC_SoundExMax
routines provided in the IB_Parse unit.
"
In actual fact the algorithms mentioned got moved to ib_utils.
Standard SoundEx (as implemented in FreeUDF) uses only a 16bit
integer value and represents only 4 or 5 characters from the
input string. The functions I wrote (TC_SoundEx and
TC_SoundExMax) extend the algorithm to a 32 bit integer value
and support 8 or 9 characters from the input string - and this
allows a more refined search (and better index selectivity in
large tables).
SoundExMax comes in when you want to support searching for
strings that sound like they "start with" some given (brief)
input string. This lets a user type in only a few characters
and still get a good result from the more precise 32bit
processing.
For example:
If I type in "Smithsonian" then the 32bit versions return:
TC_SoundEx = 1403094056
TC_SoundExMax = 1403094056
They both give the same value because the SoundEx is getting
filled from the input string (would be the same even if the
name were longer - eg: "Smithsonianstuff").
But what happens if the user only inputs "Smith"?
TC_SoundEx = 1403092992
TC_SoundExMax = 1403097087
See what is happening here? TC_SoundEx is giving you the
actual (32bit version) SoundEx for "Smith". But as you can
see looking at the previous example the SoundEx for "Smith"
does not match the SoundEx for "Smithsonian" and so it would
not be found. "Smith" does _not_ sound like it "Smithsonian",
so that is to be expected and desired.
But what if I want to find all names that sound like they start
with "Smith" (and so want "Smithsonian")? Then you do a
search like this (automatically built by IBO if you define a
return from TIB_Connection.OnSoundExMaxParse):
SELECT ... FROM ATABLE
WHERE SOUNDEX_FIELD >= TC_SoundEx('Smith') AND
SOUNDEX_FIELD <= TC_SoundExMax('Smith')
This works because SoundExMax calculates the largest possible
SoundEx result given a smaller input string. Hence the name
"SoundExMax". The "Max" refers to the soundex value, not to
the number of results you can expect.
I found that SoundExMax was necessary for the new 32bit version
of SoundEx because many names are shorter than 8 or 9 chars,
and so good results meant using the range select if you wanted
"starting with" type semantics. The old 16bit soundex did not
suffer from this very much, because most names are at least 4
or 5 chars in length - and so the 16bit SoundEx for "Smith" and
"Smithsonian" are likely to be the same.
Much of this will be irrelevant if you are using the FreeUDF
16bit version of SoundEx - since you will also need to call
that function from the TIB_Connection.OnSoundExParse event and
it does not have a "max" variation AFAIK.
I hope that explains it well enough.
(I know what I mean ;-)
--
Geoff Worboys
Telesis Computing
> If anyone knows, I'd appreciate an answer. Why doesThat is one of my inventions... I have been known to use
> SoundexMax deliver fewer hits then Soundex?
misleading names - despite trying to be clear. :-)
SoundexMax is actually meant to be used in conjunction
with Soundex - it is not intended to be used on its own.
This is from the help for TIB_Connection.OnSoundExMaxParse:
"
This is NOT a standard SoundEx capability, so if your
algorithm does not support this simply leave this property
undefined - and the OnSoundExParse property will be used in
its place, giving you the same result as usual. If your
algorithm does support range selection then, if the NOTRAILING
attribute is defined on a search column the SQL generated will
select WHERE Fld_SX >= SoundExValue AND Fld_SX <= SoundExMaxValue.
For more information refer to the TC_SoundEx and TC_SoundExMax
routines provided in the IB_Parse unit.
"
In actual fact the algorithms mentioned got moved to ib_utils.
Standard SoundEx (as implemented in FreeUDF) uses only a 16bit
integer value and represents only 4 or 5 characters from the
input string. The functions I wrote (TC_SoundEx and
TC_SoundExMax) extend the algorithm to a 32 bit integer value
and support 8 or 9 characters from the input string - and this
allows a more refined search (and better index selectivity in
large tables).
SoundExMax comes in when you want to support searching for
strings that sound like they "start with" some given (brief)
input string. This lets a user type in only a few characters
and still get a good result from the more precise 32bit
processing.
For example:
If I type in "Smithsonian" then the 32bit versions return:
TC_SoundEx = 1403094056
TC_SoundExMax = 1403094056
They both give the same value because the SoundEx is getting
filled from the input string (would be the same even if the
name were longer - eg: "Smithsonianstuff").
But what happens if the user only inputs "Smith"?
TC_SoundEx = 1403092992
TC_SoundExMax = 1403097087
See what is happening here? TC_SoundEx is giving you the
actual (32bit version) SoundEx for "Smith". But as you can
see looking at the previous example the SoundEx for "Smith"
does not match the SoundEx for "Smithsonian" and so it would
not be found. "Smith" does _not_ sound like it "Smithsonian",
so that is to be expected and desired.
But what if I want to find all names that sound like they start
with "Smith" (and so want "Smithsonian")? Then you do a
search like this (automatically built by IBO if you define a
return from TIB_Connection.OnSoundExMaxParse):
SELECT ... FROM ATABLE
WHERE SOUNDEX_FIELD >= TC_SoundEx('Smith') AND
SOUNDEX_FIELD <= TC_SoundExMax('Smith')
This works because SoundExMax calculates the largest possible
SoundEx result given a smaller input string. Hence the name
"SoundExMax". The "Max" refers to the soundex value, not to
the number of results you can expect.
I found that SoundExMax was necessary for the new 32bit version
of SoundEx because many names are shorter than 8 or 9 chars,
and so good results meant using the range select if you wanted
"starting with" type semantics. The old 16bit soundex did not
suffer from this very much, because most names are at least 4
or 5 chars in length - and so the 16bit SoundEx for "Smith" and
"Smithsonian" are likely to be the same.
Much of this will be irrelevant if you are using the FreeUDF
16bit version of SoundEx - since you will also need to call
that function from the TIB_Connection.OnSoundExParse event and
it does not have a "max" variation AFAIK.
I hope that explains it well enough.
(I know what I mean ;-)
--
Geoff Worboys
Telesis Computing