Subject Re: Index tales - part 2 - Keyword FTS
Author Roman Rokytskyy
Hi,

> > For example, issuing a Search for the word 'secret',
> > looking for a product the stream will return (where? in a table? in a
> > array?) something like:
> >
> > Products.Name: "Lord's secret" (A book, for example)
> > Acc.Description: "My best kept secret account"
> >
> Intelligent search (why have any other kind?) is fast and presents
> the hits in an ordered concise, descriptive manner. Note that it
> takes less time to Google something that to drill down through most
> application menus.

If you remember, our previous discussion on this topic (a year or two
ago) was about ranking the hits in the database. You have presented
the scheme implemented in Netfrastructure and I showed you an example
where search of some phrase on www.ibphoenix.com performed by
Netfrastructure itself had worser ranking than the same search done
via Google.

If I'm not mistaking, we have agreed that there cannot be anything
similar to Google's PageRank in databases. As possible extension there
was mentioned approach that somehow utilizes FK dependencies to rank
the records. Traditional ranking that AltaVista used provides much
more worser results compared to Google and we can't construct
PageRank-like thing for records.

I am still thinking on this topic, but have no good idea about using
web-like FTS in the database.

> And you're postulating that a global
> composite index is slow. Have you noticed that Google with billions
> and billions of indexed documents does searches in milliseconds?
> Stop and think about that. Then drop the non-sense about
> performance.

8 billions of documents. And did you noticed that Google has between
45,000 and 80,000 of nodes? Which gives us something between 300,000
and 600,000 documents per node (they usually replicate data to at
least three other hosts)... avg. 17k per document gives us something
~10 GB of text data with search times between 100 and 700 ms.

I think that numbers of this magnitude are achievable. One of the
largest indices in Lucene has size of 87 GB and is reported to be
blazingly fast. And that is running JDK 1.3.1 and file-system based index.

Roman