Subject | Re: [Firebird-Architect] Full Text Search |
---|---|
Author | unordained |
Post date | 2005-02-06T04:28:15Z |
Some thoughts:
In the last research document I saw for how google "works", there was mention of an upper limit of
4095 on the 1-based position of words in a document. The max value (4095, 0-based) was used for
anything beyond that point. At least as of that time, long documents would be less easy to search
for multi-word patterns. The choice of an upper limit is a tough one, if you care about such things.
Can the "html-aware" feature, if implemented, be based on Yaffil's function-based indexing? Index
the result of strip_html(field_name) rather than the field itself? That would make it more
orthogonal, which seems like a good thing to me. (Firebird is getting functional indexing
eventually, right?)
When I helped a normally mysql-only guy set up full-text indexing of public documents (using
Firebird), I found the whole endeavour frustrating. Firebird worked great; users however just want
full-text indexing because they're too lazy to enter good meta-data. Oddly, it's hard to find
documents by date when you're searching only the text of the document, and have no idea how the
date might be formatted ("February 1st, 2005"), particularly if your index excludes short words
("1st") and symbols (the slashes if they're even entering the date 'normally'.)
We sorted the results based on how often words occurred, weighting each word based on its rarity in
the overall document set. (We only had to deal with AND and NOT searches.) How would firebird
return something useful to people who want the results sorted according to other rules? Rarity of
words, number of occurrences, position in the document, relative position to each other in the
document, ... and that's just the tip of the proverbial ice cube.
-Philip
In the last research document I saw for how google "works", there was mention of an upper limit of
4095 on the 1-based position of words in a document. The max value (4095, 0-based) was used for
anything beyond that point. At least as of that time, long documents would be less easy to search
for multi-word patterns. The choice of an upper limit is a tough one, if you care about such things.
Can the "html-aware" feature, if implemented, be based on Yaffil's function-based indexing? Index
the result of strip_html(field_name) rather than the field itself? That would make it more
orthogonal, which seems like a good thing to me. (Firebird is getting functional indexing
eventually, right?)
When I helped a normally mysql-only guy set up full-text indexing of public documents (using
Firebird), I found the whole endeavour frustrating. Firebird worked great; users however just want
full-text indexing because they're too lazy to enter good meta-data. Oddly, it's hard to find
documents by date when you're searching only the text of the document, and have no idea how the
date might be formatted ("February 1st, 2005"), particularly if your index excludes short words
("1st") and symbols (the slashes if they're even entering the date 'normally'.)
We sorted the results based on how often words occurred, weighting each word based on its rarity in
the overall document set. (We only had to deal with AND and NOT searches.) How would firebird
return something useful to people who want the results sorted according to other rules? Rarity of
words, number of occurrences, position in the document, relative position to each other in the
document, ... and that's just the tip of the proverbial ice cube.
-Philip