Subject Re: Google-like scoring in databases
Author Roman Rokytskyy
Jim,

> I just implemented a crude mechanism for manually entered stop
> words; an automatic scheme would be nicer. I'd be interested to see
> how Lucene handled the problem.

Lucene indexes documents. Document is a collection of fields that are
name/value pairs and can have three flags set: indexed, stored and
tokenized.

If field is tokenized, before processing it is piped through tokenizer
(aka filter), that splits a text into list of tokens. Then Lucene
stores these tokens in the index (afaik, normal b-tree). Stemming,
stop words filtering, etc. happens within the tokenizer. Tokenizers
can be combined into a chain, like one will convert all words into
lower case, another will strip stop words, next one will apply some
stemming algorithm, and so on. Depending on the chain you will get
different behavior.

Same approach is used for searching. However, in this case query is
piped through tokenizers and result is used for searching. If you use
different chains during indexing and during search, result will be,
most likely, incorrect. Result of search is a list of document/score
pairs. Returned document is not equivalent to the one being indexed -
only "stored" fieds are returned.

For English language Lucene provides Porter stemmer
(org.apache.lucene.analysis.PorterStemmer) and StopFilter. Here's
their stop words:

/** An array containing some common English words that are usually not
useful for searching. */
public static final String[] STOP_WORDS = {
"a", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};

There also "built-in" analyzers for German and Russian languages.
Additionally you can find some phonetic algorithms like Soundex,
Metaphone and
DoubleMetaphone(http://www.companywebstore.de/tangentum/mirror/en/products/p
honetix/index.html). Also some time ago there was a library that
allowed to use ispell dictionaries
(http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell.html) to be used
with Lucene... I should have it somewhere, but cannot remember where.
Also I did not find plurals filter, but I suspect that this
ispell-enabled filter will handle it.

Following thread discusses on using stemming for searches:
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@...&msgId=651397

Doug Cutting is the Lucene's father and usually posts interesting
information about various search techniques.

Best regards,
Roman Rokytskyy