Subject | Re: Google-like scoring in databases |
---|---|
Author | Roman Rokytskyy |
Post date | 2003-07-01T06:11:10Z |
Jim,
name/value pairs and can have three flags set: indexed, stored and
tokenized.
If field is tokenized, before processing it is piped through tokenizer
(aka filter), that splits a text into list of tokens. Then Lucene
stores these tokens in the index (afaik, normal b-tree). Stemming,
stop words filtering, etc. happens within the tokenizer. Tokenizers
can be combined into a chain, like one will convert all words into
lower case, another will strip stop words, next one will apply some
stemming algorithm, and so on. Depending on the chain you will get
different behavior.
Same approach is used for searching. However, in this case query is
piped through tokenizers and result is used for searching. If you use
different chains during indexing and during search, result will be,
most likely, incorrect. Result of search is a list of document/score
pairs. Returned document is not equivalent to the one being indexed -
only "stored" fieds are returned.
For English language Lucene provides Porter stemmer
(org.apache.lucene.analysis.PorterStemmer) and StopFilter. Here's
their stop words:
/** An array containing some common English words that are usually not
useful for searching. */
public static final String[] STOP_WORDS = {
"a", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};
There also "built-in" analyzers for German and Russian languages.
Additionally you can find some phonetic algorithms like Soundex,
Metaphone and
DoubleMetaphone(http://www.companywebstore.de/tangentum/mirror/en/products/p
honetix/index.html). Also some time ago there was a library that
allowed to use ispell dictionaries
(http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell.html) to be used
with Lucene... I should have it somewhere, but cannot remember where.
Also I did not find plurals filter, but I suspect that this
ispell-enabled filter will handle it.
Following thread discusses on using stemming for searches:
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@...&msgId=651397
Doug Cutting is the Lucene's father and usually posts interesting
information about various search techniques.
Best regards,
Roman Rokytskyy
> I just implemented a crude mechanism for manually entered stopLucene indexes documents. Document is a collection of fields that are
> words; an automatic scheme would be nicer. I'd be interested to see
> how Lucene handled the problem.
name/value pairs and can have three flags set: indexed, stored and
tokenized.
If field is tokenized, before processing it is piped through tokenizer
(aka filter), that splits a text into list of tokens. Then Lucene
stores these tokens in the index (afaik, normal b-tree). Stemming,
stop words filtering, etc. happens within the tokenizer. Tokenizers
can be combined into a chain, like one will convert all words into
lower case, another will strip stop words, next one will apply some
stemming algorithm, and so on. Depending on the chain you will get
different behavior.
Same approach is used for searching. However, in this case query is
piped through tokenizers and result is used for searching. If you use
different chains during indexing and during search, result will be,
most likely, incorrect. Result of search is a list of document/score
pairs. Returned document is not equivalent to the one being indexed -
only "stored" fieds are returned.
For English language Lucene provides Porter stemmer
(org.apache.lucene.analysis.PorterStemmer) and StopFilter. Here's
their stop words:
/** An array containing some common English words that are usually not
useful for searching. */
public static final String[] STOP_WORDS = {
"a", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};
There also "built-in" analyzers for German and Russian languages.
Additionally you can find some phonetic algorithms like Soundex,
Metaphone and
DoubleMetaphone(http://www.companywebstore.de/tangentum/mirror/en/products/p
honetix/index.html). Also some time ago there was a library that
allowed to use ispell dictionaries
(http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell.html) to be used
with Lucene... I should have it somewhere, but cannot remember where.
Also I did not find plurals filter, but I suspect that this
ispell-enabled filter will handle it.
Following thread discusses on using stemming for searches:
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@...&msgId=651397
Doug Cutting is the Lucene's father and usually posts interesting
information about various search techniques.
Best regards,
Roman Rokytskyy