firebird-architect - Applications of Encoded Data Streams

Subject	Applications of Encoded Data Streams
Author	Jim Starkey
Post date	2005-05-02T14:50:35Z

The self-describing, platform independent data stream encoding schemes
that we've been discussing have at least three interesting potential
applications within future versions of Firebird.

The first, as described earlier, is the message format used by the lower
layers of a new API and plumbing to move data around. The model is that
data on engine and client sides encode and transmit database in
"original" form with any conversions necessary performed on the data
consumer's context. This means that a client can set data in any type
it wishes, the data will be encoded and transmitted in those that type,
and if the engine needs the data in a different type, the engine
performs the conversion. An exception is character encoding. All
character data is encoded in utf-8 unicode. For reference, the client
side of DSQL uses information from "prepare describing" to perform data
conversions on client side before sending the data, and character data
retains it's original encoding. The goal of encoded data streams is to
eliminate intermediate data copy and conversion steps to reduce the
overhead of the plumbing.

The second application would be to support "getRecord" and "setRecord"
methods in the higher layers of the new API. The basic idea is that an
application could do a single getRecord all to get a complete row from a
ResultSet that could later be used to "setRecord" to a prepared insert
statement with the same field list. This mechanism is ideal for a
future version on gbak or similar backup/restore utility. Another
natural use is in replication.

The third application of encoded data streams is as a record encoding in
the on-disk structure. There is some appeal to an increase in record
density, but the main benefit would be to eliminate physical string
length restrictions within the engine. In the current ODS, strings are
stored within a record if a fixed length field of some declared size.
This worked OK for ASCII, but multi-byte characters, in general, and
utf-8, in particular, make this problematic. Use of encoded data stream
would divorce the physical storage from the logical declaration,
reducing the declared field length to a constraint on the number of
characters, not bytes, of a string. It would also allow us to implement
a unconstrained length string type like Java, deprecating the concept of
fixed length strings as an unfortunate legacy of punched cards.
Adoption of encoded data streams within the ODS, however, is predicated
on switching the engine from multi-character set to purely utf-8.

Personally, I think the ideas of arbitrary string handling, simplified
international character set support, and a quantum leap in plumbing
efficiency is pretty exciting. The fact that it would also reduce code
size and complexity is just icing on the cake.

--

Jim Starkey
Netfrastructure, Inc.
978 526-1376