Subject Re: [Firebird-Architect] Re: Index tales - part 2/3 - CREATE JOIN / Keyword FTS (was Part 2: - Keyword FTS)
Author = m. Th =
Roman Rokytskyy wrote:
>
> Hi,
>
> >> For example, issuing a Search for the word 'secret', looking for
> >> a product the stream will return (where? in a table? in a array?)
> >> something like:
> >>
> >> Products.Name: "Lord's secret" (A book, for example)
> >> Acc.Description: "My best kept secret account"
> >>
> > Intelligent search (why have any other kind?) is fast and presents
> > the hits in an ordered concise, descriptive manner. Note that it
> > takes less time to Google something that to drill down through most
> > application menus.
>

Yes but at least "some" sites has these dark menus and rely heavily on them!
Let's see some examples:
- www.firebirdsql.org
- www.borland.com
- www.netfrastructure.com (!!! :))) ) (sorry, but I couldn't resist...)

> If you remember, our previous discussion on this topic (a year or two
> ago) was about ranking the hits in the database. You have presented
> the scheme implemented in Netfrastructure and I showed you an example
> where search of some phrase on www.ibphoenix.com performed by
> Netfrastructure itself had worser ranking than the same search done
> via Google.
>

One of the very BIG problems of unitary searching the unrelated (and/or
uncorrelated) groups of data.
Other problem is the garbage which is retrieved... But I'm not biased
against the global index which Jim proposed. IMHO, relying on it as the
single alternative isn't the best thing to expect. See bellow.

> If I'm not mistaking, we have agreed that there cannot be anything
> similar to Google's PageRank in databases. As possible extension
> there was mentioned approach that somehow utilizes FK dependencies to
> rank the records. Traditional ranking that AltaVista used provides
> much more worser results compared to Google and we can't construct
> PageRank-like thing for records.
>
> I am still thinking on this topic, but have no good idea about using
> web-like FTS in the database.

Agree. With one correction. "have no good idea about using _ONLY_
web-like FTS in the database". If Jim wants to implement it, so far so
good. I'm not interested so much in _how_ to do it. If Jim proves us
that a composite global index is better, do it. I'm interested in _what_
this feature will do and _when_ it will be ready.
Someone (Jim) can answer to this? Can you implement this in Firebird?
(not only regarding to skills, but mainly to time, changes needed to API
libs on Delphi and C/C++/C# to handle this aso.) When it will be ready?

Another thing. - the CREATE JOIN.

If a global index is ok and easy to implement, then why not have a
global multi-table index (not FTS) for all VARCHAR fields? Or for all
INTEGERS? Or for all DATE fields? (etc.)

IMHO, a more flexible approach would be (exact semantics to be discussed
later) CREATE [KEYWORD] INDEX <name> ON <table1>(<field1>[,
<field2>,...]) [, <table2>(<field3>[, <field4>,...]), ...] to allow
creation on a user-specified multi-table, multi-field index. (In fact, I
prepared this some days ago to be the 'Index tales - part 3', but if Jim
brought the discussion here...). To be honest, I had in mind a CREATE
JOIN syntax with only two tables which would be an b-Tree index to hold
and update the bitmap needed for a join, but if you have time to
develop... let's go to the more general case. This structure will hold
instead of a b-Tree <key> <recno> nodes a b-Tree with <key>
<Table><recno> which will help also in searching (thus slower, IMHO, but
Jim says that's ok) and also in joining (this could be much faster now,
using such a structure). Any comments?

hth,

m. th.

>
> > And you're postulating that a global composite index is slow. Have
> > you noticed that Google with billions and billions of indexed
> > documents does searches in milliseconds? Stop and think about that.
> > Then drop the non-sense about performance.
>
> 8 billions of documents. And did you noticed that Google has between
> 45,000 and 80,000 of nodes? Which gives us something between 300,000
> and 600,000 documents per node (they usually replicate data to at
> least three other hosts)... avg. 17k per document gives us something
> ~10 GB of text data with search times between 100 and 700 ms.
>
> I think that numbers of this magnitude are achievable. One of the
> largest indices in Lucene has size of 87 GB and is reported to be
> blazingly fast. And that is running JDK 1.3.1 and file-system based
> index.
>
> Roman
>
Right. The Google 'big' index isn't so big. The average load/machine
isn't quite high. But in which time frame can you implement all the
things needed to do this (ie. including inter-server replication)? And
BTW, I don't want to compete Google, for the site of my institution I
rather prefer to have them to search my site which has a far better
search engine than I can build in ten years, I only want to bring some
very handful features in a reasonable amount of time to my users
(doesn't matter if their applications are desktop based or web based,
they have FB relational data as back end).
But, again, if you have plenty of development resources why you don't
port and/or adapt to engine Lucene and/or OpenFTS
(http://openfts.sourceforge.net/features.shtml)? These are quite mature
and with many advanced features.

my 2c,

m. th.