Subject Re: Google-like scoring in databases
Author Roman Rokytskyy
> That was the intuitive idea behind blob filters and UDFs. But I
> haven't the foggiest idea of what the engine could do with audio or
> video streams.

Me neither, but DB2 and Oracle have them! ;) The only thing I can
imagine is some advances application like Napster keeping track of all
sound samples with querying capabilities by sound samples. But that
from the science fiction area.

> >- plugin introduces new RDB$SEARCHABLE_FIELDS;
> >- plugin extends SQL with some new keywords;
> >
> Although some forward looking fellow designed Interbase with user
> extensible system tables, the same dolt neglected to implement
> backup and restore for them, so you have some coding to do.

I think this has to be done anyway. As far as I remember, there were
problems during backup/restore with custom tables with either RDB$
prefix or system_flag set.

> > 2. Somebody needs to maintain the word index for all insert,
> >updates, and deletes of records that contain searchable fields.
> >- auto-generated trigger
> >- plugin extends SQL processing
> >
> >
> The trigger is a piece of cake. The "extends SQL processing" isn't.
> Firebird is still using the crapola YACC parser I cobbled together
> to make a quicker DSQL. Ain't nothing extensible about it. YACC is
> preprocessed at build time, so it's out (probably a very good
> thing). So you need a new parser that loads the grammar at runtime.
> Lots of candidates, but none particularly faster. Since Firebird
> doesn't cache compiled statements, when pumping transactions it's
> often compile bound. Plan to fix that (which needs doing anyway).
> Then there are the various stages of compilation -- semantic
> analysis, boolean distribution, view and computed field expansion --
> that need to be address. Major architectural problem here.

Question is where do we define the extension point. If we define it
within the parser itself, yes, we need to address problems you
described. If we define the parser itself to be an extension point,
then plugin can replace it completely. How will plugin handle this -
that's the problem of plugin.

In this case we can keep BLR as internal language, and parsers will
produce BLR that can be directly executed by the engine. We have to
ensure that BLR itself is extensible.

> Designing a plugin API is a piece of cake any time you're willing to
> freeze the internals, which is never, leaving you with a situation
> where you have to distribute a version of a plugin for every minor
> version of the system. If you want to get an idea of bad this can
> get, look at the Mozilla spell checker. There's a version for 1.3,
> 1.4a, 1.4b, etc. OK, they may be thugs, but they're probably
> morons, but internals must change to make the system better.

We have similar problems with remote protocol. It has to change to
make system better. We have to find the right extension point within
the engine.

> >Why cannot we use rdb$db_key? Within one table it seems to be
> > unique.
> >
> Again, a question of architecture. Yes, rdb$db_key is, in fact, the
> table id and record number. But that isn't architecturally
> guaranteed. If you build in a dependence on an implementation
> artifact, it become de facto architectural. Maybe Firebird will
> want to switch to 6 byte record numbers in the future. Architecture
> supports it, disks are plenty big enough, but once you've designed
> in a dependency, things get sticky.

Again, we have to find the right solution. I do not know internals of
the Firebird so well, but enforcing the stability of rdb$db_key seems
to be less painfull than trying to invent something new.

> >Make optimizer plugin-enabled.
> >
> To summarize by previous comment, NFW.

Ok, then replace the optimizer. :)

> >Why cannot we use SQL here? i know that you do not accept this
> >idea, and Netfrastructure uses JDBC extension instead. But I think,
> >the only thing that has to be transferred to the client in addition
> >to the data in the query is score, but why cannot we introduce
> >CURRENT_SCORE pseudo-variable?
> >
> The problem isn't the DML, it's the retrieval mechanism.
> Netfrastructure returns an object of class ResultList than can
> iterate a ordered list of ResultSets. Unless you can find a way
> to do the same within the existing native API, you will lose the
> ability to express the results of a multi-table search, which I
> don't think you want to do.

No I don't, but also I do not want us to start implementing something
big and from the scratch, like Mozilla people did [I'm wondering why
didn't they start to write their own compiler]. So, multitable
searches will come after successfull implememtation of:

CREATE TABLE my_table(
col1 INTEGER,
col2 VARCHAR(255) NOT NULL,
...
some_text_field TEXT SEARCHABLE
);

...

SELECT col1, col2, col3 FROM my_table
WHERE some_text_field SATISFIES QUERY ? WITH SCORE ?

Best regards,
Roman Rokytskyy