Subject Re: [Firebird-Architect] Indexing tales - part 1 - Siblings
Author Jim Starkey
Ivan Prenosil wrote:
>> That's not a reasonable thing to ask for (try it with Google, for
>> example). It is reasonable to ask for a root word search ("manch*" for
>> Manchester), that is easy to do.
>>
>
> That's not reasonable to expect that database stores only English words,
> nor that suitability of different searching methods is the same for all languages.
> E.g. because Czech is "wysiwyg" language, SoundEx is absolutely useless
> for me, which does not mean it can't be useful for others :-)
>
My point was that an index designed to support contextual word search
can also handle root searches, not that root searches are an essential
requirement. Root searching works because the "natural" order in the
index is alphabetical. But since the unit is the word, not any
particular ordering, other orderings could be used. Would other
orderings be useful? Beats me. Make a case.
>
>
>> Taking your example further, "and" will almost always be a "stop" word
>> ignored my searches.
>>
>
> Czech translation of "and" is "a". "and" has no meaning in Czech
> so I have no reason to consider it as stop-word.
>
A stop word is a word with a selectivity too low for useful searching.
They're not only language specific but often language specific. Certain
terms of legal jargon, for example, occur so frequently that while
useful in a phrase, they're useless by themselves.

Netfrastructure started with stop words hard coded. Then it moved to a
inheritance mechanism through an application hierarchy. In the next
version we'll probably have query specific stop words.
>
>
>> Finally, things like depluralizing make sense to do in a preprocessing
>> step rather than in the index.
>>
>
> Complexity of preprocessing and reliabiliy of result vary widely for different languages.
> Sometimes it can be better to simply index words exactly as stored in document
> than to let computer "guess" the correct meaning of the indexed word.
>
The preprocessing is on the query, not index, side.