Subject Re: Google-like scoring in databases
Author Roman Rokytskyy
> ... The page ranking is a second score that attempts
> to weight the results with regard to what other sites thought
> useful...

Which happens to be the hand-made ranking (even by a vast number of
hands) :)

> Massive number of duplicates is a difficult problem -- search
> refinement is almost always the right solution. Searching the
> IBPhoenix for "blob" or "blobs" is going get a large number of
> duplicates hard to differentiate, the "blob seek" or "blob
> subtypes" or "blob filters" will probably do the trick. But
> even in the case of "blobs", a scoring scheme based on the
> references weighted by the inverse of the word number will
> do a very good just of find general discussion articles.

And you can change the weight depending on the place where you have
found catch words (for example, headline match would get more weight
than content match).

I think separate test is needed, if we can get pretty good
selectivness/relevance of database indexing by using only lexical
information. From my experience of intergrating ht://dig with
CoreMedia content management system, you can get good results without
using references between documents.

Does anybody have suggestions regarding the test database?

Best regards,
Roman Rokytskyy