Subject Re: trying to avoid large datasets
Author Adam
> My first thought was Soundex when I read about your problem.
However, it's probably of only little
> or no use for you.
> Creating the soundex for "a river somewhere" returns A6162 whereas
"river somewhere, a" returns
> R1625. Those two strings do not seem very similar for the soundex

That depends on your implementation. All you need to do is to break
down the title into its constituent words, calculate the soundex of
each word, and follow the same logic on the word breakdown of the
title passed in.

From this you can rank the titles with the most similar words. You can
quite easily take it even further by adding a table of ignorable words
like 'a' and 'the'.

You just need a table which holds words and their soundex value, and a
table that links the words table to the title table. A stored
procedure can be used for this.

Don't forget as well that in the original massive grid, there would be
no way to realise that 'a river somewhere' was actually in there when
you type in 'river somewhere', because it was 90000 records earlier.

> What you need is the "levenshtein distance". This is an algorithm
that calculates how many single
> characters have to be swapped, added, shifted or left out to get two
strings to be the same.

That would work on small datasets, but there is no way of indexing the
target words. In other words, you would need to do a levenshtein
distance calculation between the title and the result for all 120000
records each time, and this may take a few seconds.

> Ask aunt google for it, it returns a heap of sites.

And according to the news, google is now officially a verb ;)