firebird-support - Re: trying to avoid large datasets

Subject	Re: trying to avoid large datasets
Author	Adam
Post date	2006-07-10T11:53:44Z

> > would i be better off writing it as a udf or a stored procedure?
> >
> > mark
> >
>
> I think it is easier to take one of the existing routines an put it

into an UDF.

>
> What Adam wrote does make sense though, so give it a try on how long

it takes to compare all your

> existing entries with the one just being entered.
> Unfortunately, with the Levensthein Distance one cannot prepare the

existing records but has to make

> the comparison anew whenever another entry is added.
> Maybe you can limit this search to those entries with the same

artist/band?

Perhaps Mark can combine the approaches together. Soundex is good for
generating 'close' matches but does not address the ranking question.

The first step would be to see if the title matches exactly as it was
typed in, if so return that record immediately, forget soundex and
distances.

If you had a stored procedure that tokenised the title into
constituent words, and stored those words with their soundex into a
words table, the soundex field being indexed. Then when someone tries
to add a title, tokenise their title into words, and find all titles
that sound like any of the words typed. At this stage you may want to
have a list of words to not consider ('a', 'and', 'the', etc). This
will hopefully narrow down the list into a couple of hundred records
at worst.

You can now calculate the Levensthein Distance between the total
phrase they type and each candidate title (I would use UDF, and first
check if a pre-written library already contains an implementation),
and order the candidates from the best match down to the worst match
based on the distance. Return the first 10 or so, no point wasting
time with any more.

With a bit of further analysis, you can build a list of words and the
number of titles per word with a pretty simple group by. You can order
by the number of titles which will give you a pretty good indication
as to which records should be ignored.

Depending on the expected size of the average candidate title list,
you could also rank by the number of matched soundex terms. I think
you will find the key here is processing speed. Indexed lookups are
extremely quick and cheap. CPU intensive calculations will hurt you if
you have multiple users doing the same searches at the same time, so
my tip is to use Soundex techniques to narrow your candidates to an
acceptable number, then use Levensthein Distance to work out the rank
of those few candidates, and only return the first 10 or so. You
should get subsecond responses providing your candidate list is not
too big, and your candidate list will only get big if you are not
eliminating common words.

Adam