firebird-architect - RFC: Cross database queries

Subject	RFC: Cross database queries
Author	Vlad Horsun
Post date	2007-08-01T10:26:22Z

Below is draft version of proposal of how we may implement cross database
queries. It is not a truth at a last instance, it is incomplete and may contain
bugs (don't register them at tracker, please ;))

All opinions, additions etc are welcome.
Thanks

1. Registration of an external source of the data

CREATE | ALTER EXTERNAL DATA SOURCE
<ds_name>
PROVIDER <provider_name>
CONNECT <ds_connection_string>
[AS {USER <external_user> [PASSWORD <external_password>] | CURRENT_USER}]
[OPTIONS < ds_options >]

DROP EXTERNAL DATA SOURCE
<ds_name>

Where:

<ds_name>: string - name of an external data source, unique in database.
Must be valid Firebird identifier

PROVIDER <provider_name>: string - identifier of plugin realizing work with an external
data source. Most likely - a name of dll, by analogy with UDF

CONNECT <ds_connection_string>: connection string, understandable to provider

(connection string should not contain user name\password, else will be not possible to
correctly organize pool of connections to external server, based on users names ?)

AS : defines external login for work with a data source :
а) USER <external_user> [PASSWORD <external_password>], or
b) CURRENT_USER - for connection with others Firebird data sources using
authentication common with current database

OPTIONS <ds_options>: parameters of interaction between an engine and provider
(external connections pool size, connection timeout, etc), a string of a format
"param=value;..."
The list of parameters will be specified later

The list of external data sources is stored in local DB in the new system table(s)

2. Login mappings management

CREATE | ALTER EXTERNAL USER
ON <ds_name>
{AS < ext_user_name > [PASSWORD <ext_password>] | CURRENT_USER}
{FOR < int_user_name > | CURRENT_USER | ALL}

DROP EXTERNAL USER
ON <ds_name>
{FOR < int_user_name > | CURRENT_USER | ALL}

Where:

ON <ds_name>: a name of an external data source for which this mapping is defined

AS <ext_user_name> [PASSWORD < ext_password >]: a name and the password of the
user understandable to an external source

AS CURRENT_USER : the external user's name is equal to a local user's name. The mechanism
of such authorization is in doubt. Probably it will be possible to implement between
two Firebird engines, or between two DB on the same engine using common security
database (vulcan)

FOR <int_user_name>: for which local user this mapping is defined

FOR CURRENT_USER - this mapping is defined for current local user

FOR ALL - this mapping is defined for all local users for which it not defined explicitly
for the given external data source

For every local user (and for ALL) possible to set no more than one mapping within each
external data source.

The explicitly defined mapping redefines what is set in external data source definition
(see CONNECT AS in CREATE EXTERNAL DATA SOURSE statement)

The mappings stored in local DB in the new system table (not in security database as
mappings depends on locally defined <ds_name>)

3. Direct (pass-through) queries to external data source (not heterogeneous)

[FOR] EXECUTE STATEMENT
<stmt_sql>
[(<in_param_list>)]
[INTO < var_list >]

[ON EXTERNAL [DATA SOURSE] < ds_name >]

[WITHIN [AUTONOMOUS | COMMON [2PC]] TRANSACTION]

Here:

ON EXTERNAL [DATA SOURSE] <ds_name> : defines an external data source to which query will
be passed for execution. It may be literal or variable

WITHIN [AUTONOMOUS | COMMON [2PC]] TRANSACTION : defines transaction behaviour (see below)

Connection with an external data source is established at the moment of query preparing
and\or taken from connection pool. Pool of external connections common for an engine, suitable
connection is defined by a data source name and an external user name (using mapping of local
user name on external user name). It is necessary to set timeout for end of idle connections.

(Parameters POOL_SIZE and POOL_IDLE_TIMEOUT ?)

External queries binds to local connection and all open queries are closed when local
connection terminates.

External connection closed automatically if there are no more open (executed) statements
(or puts back in pool), or after some timeout expiration. In the second case all open external
transactions are rolled back, queries are closed.

4. Heterogeneous queries to external data source

Table qualifier: <ds_table_name>@<ds_name>

It is necessary to teach parser and optimizer to cut from heterogeneous query such parts
which may be sent to an external engine. Also it is necessary to demand ours SQL syntax and to
decide where to handle built-in functions and aggregates. Not clear how to handle joins of local
and external tables - to put the external table at outmost place (and fetch all rows, but only
once) or to put it at inner place (and to execute set of parametrized queries on external data
source for every row fetched from local table)

Here i see 2 possible strategies of work :

а) Distinguish each external table, to request its format, cardinality, indexes and their
selectivity and to remember this information similarly to ours jrd_rel. It will do DSQL in pass1.
GEN will build usual rse with blr_remote_relation instead of blr_relation. Further work with
these remote_rel as with usual tables - build rsb for each of them, etc. CMP will build
parametrized queries for external data sources from these rsb's

Pluses: relative simplicity, possibility to not touch optimizer.

Minuses: not optimal plans if more than one external tables from the same data source
participate in query since they will be processed separately and joined locally though frequently
it is possible to handle on an external server. Aggregation too, most likely, will be handled
locally (?). Functions and non-standard aggregates (known to external engine) can not participate
in such queries.

b) DSQL will distinguish external tables from the same data source and do decomposition of
heterogeneous query on some "monogenic" queries. For each such external query will build rse same
as for EXECUTE STATEMENT (with in\out parameters)

It is necessary to think up a way to set an external table name with its full qualification by
external engine rules. it is possible to quote full name but it is easy may not fit in ours 32
symbols and it is not clear how to quote external quoted identifiers;)

Also such names will not fit in SQLDA. The truth, it not problems for work with external FB
data source.

5. Transactions

Provider should know how to map Firebird's transaction parameters on target server transaction
parameters. Engine will start external transaction on demand. If the external data source does not
support a some simultaneous transactions in one connection then at start of second transaction
either the error returned, or new connection created. (external data source settings ?)

Similarly with several cursors in one transaction\connection. When opening second cursor
on external data source we can :

a) Return error
b) Full fetch and cache first cursor then close it
c) Open new cursor in new connection and transaction

Lifetime of external transaction is defined by WITHIN clause (or sets default behaviour in
external data source settings) and may be :

a) AUTONOMOUS TRANSACTION
The external query will execute each time in its own new external transaction which is
completed together with external query - at its completion, EOF, error state of an external
query, completion of local query.
Each external query within the local transaction executed in separate external transaction
//This mechanism (standalone\autonomous transactions) can be used for local EXEC STMT too

b) WITHIN COMMON 2PC TRANSACTION
If the external data source supports 2PC external transactions it is joined with local
transaction. It is possible to do for IB\FB and, probably, involving the external transaction
manager such as MTS.
If it is not possible then query is not executed raising error. Subsequent external queries
which belongs to the same external data source in the same local transaction are executed in
context of available external transaction. External 2PC transaction completed together with
local transaction (not when query finished).

c) WITHIN COMMON TRANSACTION
If the external data source supports 2PC transactions it is joined with local one (as
above). If it is not possible, then lifetime of external transaction is bound to lifetime of
local transaction. External transaction completed by the same way, as local one, before local
completes. If completion of external transaction failed then error returned and engine "forgets"
about this external transaction. The user then may commit local transaction and it will be
completed without error

5a. The mechanism of 2PC transactions :

When the engine initiates new external transaction, it passed to the provider handle of local
transaction (omit what is it, for now) and remembers it. The provider himself joined external and
local transactions (omit how to do it, for now) and returns to an engine handle of new distributed
transaction. From this moment the provider is the coordinator of 2PC transaction. Certainly it may
delegate this operation to the external coordinator, such as MS DTC.

Handle of distributed transaction saved with local transaction. From this moment isc_commit,
isc_rollback and isc_prepare on client handle of transaction (this operation the client will call
eventually) are delegated to the provider. Then provider call (without WHY) isc_commit\isc_rollback\
isc_prepare on its handle of local transaction (passed to it right at the beginning). This
operation engine handles as usual - i.e. on local part of transaction.

Processing isc_prepare on client handle of local transaction depends on the ability of provider
to perform this operation for distributed transactio. The Firebird provider can make it, others -
not sure.

Thus operations of both client and engine on distributed transaction are delegated to the
provider and it has mechanism of how to call our API for a local part of whole distributed
transaction.

To support this functional by the Firebird provider it is necessary to implement mechanism of
joining existing transaction with additional :

isc_join_transaction (HANDLE tra_handle, HANDLE joined_tra_handle, HANDLE joined_db_handle)

It should be easy enough.

5b. Savepoints

If provider does not support savepoints we may :

a) prohibit everything, except for read-only queries (how to identify in foreign (unknown)
SQL syntax?)

b) don't not execute savepoints on an external data source at all (what to if local transaction
rolled back to intermediate savepoint ?)

If provider supports savepoints, then if under our internal savepoint external query was executed
then before execution of this query engine will start savepoint on external data source and local
savepoint is flagged. The further operations with local savepoint will be executed with external
savepoint too.

6. Lifetime of external queries :

a) During prepare of local query with EXECUTE STATEMENT... ON... engine build jrd_node with
appropriate parameters and allocated place for the reference to ExecuteStatement class in impure
area (i.e. as well as now plus name of a data source and transaction parameters). Formats of input
and output parameters are defined by appropriate variables.

Separate question - EXECUTE STATEMENT not in PSQL block. In this case formats message's for
client can be defined only when the text of query is known, i.e. during call isc_dsql_prepare the
text of query should be known (may not be a parameter).

For those who wants prepared reusable EXECUTE STATEMENT with parametrized text of query - will
wrap it up in PSQL, for example in EXECUTE BLOCK.
However for now EXECUTE STATEMENT may not be used outside of PSQL block ;)

b) Opening of external RSE:
- Search in a pool or create a new external connection
- Start new or a reuse existing external transaction
- Open external query

c) Fetch from external RSE:
- As well as now

d) Closing external RSE:
- Closing external query
- commit AUTONOMOUS transaction
- return external connection into a pool

7. Provider API :

The base idea is to use some well known or well designed API and adapt it to use with architecture
above. I can think about two : ISC API (then we may include provider into common Y-valve) or ODBC API
(it allow to easy implement external data source for ODBC). No details are investigated yet.

Regards,
Vlad