firebird-architect - Re: [Firebird-Architect] Data Stream Encoding

Subject	Re: [Firebird-Architect] Data Stream Encoding
Author	Geoff Worboys
Post date	2005-04-30T03:01:22Z

Jim Starkey wrote:

> Geoff Worboys wrote:
>>
>>On second thought you have presumably considered such schemes
>>and discarded them for some reason. Still I think the code
>>would be cleaner without such a huge enumeration.
>>
> Sure, there's method to my madness. Consider:

...

> Yes, there are a huge number of a cases. But they fall into
> relatively small number of classes, each is which can be
> handled with a small, efficient piece of code.

I would argue the point about "small" - looking at your enum
we are talking around 100-170 lines JUST for the cases, this
does not include any code actually needed. I really hate such
huge switch statements.

While it may be ostensibly simple code, it is a pain to read,
and maintainers invariably start sticking too much code within
each section rather than bother to set up new functions.

I have never tested just how efficient the code is when using
such a huge number of switch elements. I guess it will not be
very slow, but do wonder if early isolation would be at least
as fast.

To my mind it would be better to immediately identify the major
encodings and direct to specific functions.

> In practice, an encoded data stream will always be prefaced
> by a version number. At the expense of maintaining a variety
> of different historical encodings, we can tweak the sizes of
> the various classes as we gain experience.

I guess my main point was that, by isolating the major types
early, any changes needed may be isolated to the relevant
functions. Thus if a change only effects one encoding the
others remain in their own function ready to be called by
whatever versions still use them.

I mentioned in my previous post that other "clever" options
exist when using bit fields. Below is a description of such.
It is obviously more complicated (to explain) than my original
suggestion but still quite simple and fast to code and process.
Its primary advantage lies in expanding the capabilities of the
integer, utf8 and opaque data types to come closer to your
large enumeration.

2-bits for major type grouping (how to interpret next 6 bits)
0 = other types
1 = integer types
2 = utf8 types
3 = opaque types

integer types:
6-bits of integral value
-32 == actual integer value follows (encoding info
embedded in the (scale?) byte that follows)
-31..+31 = actual integral value (unscaled) - or any
other range of 63 values you want, just define an
appropriate offset.
OR you could reserve a series of values for use as
an enumeration that describes the following encoding.

utf8 and opaque types:
6-bits of size information
0..62 = actual size of following string/binary
63 = indicates that count bytes follow
count-bytes use only 7 bits, on last count byte in
the sequence the 8th bit is set to indicate end of
count-bytes (or vise-versa)
OR you could reserve a number of values and say that
anything over 50 (for example) indicates that (x - 50)
count bytes exist.

other types:
3-bits
0 = specific value (no data follows)
3-bits
0 = null
1 = boolean true
2 = boolean false
3 = ?

1 = date (# of days since ?)
3-bits = date encoding enum (rather than size)

2 = time (# of millis since ?)
3-bits = time encoding enum (rather than size)

3 = float
3-bits = encoding enum (rather than size)

4..7 = expansion

Side note: I prefer to have boolean true/false exist directly
(rather than reverting to integer) because it is nice to be
explicit where feasible to do so.

Side note 2: The 6-bit values for integer, utf8 and opaque
values can be seen as 64-item enumerations, that you can decide
to map to a specific range of inplace values and other parts to
encoding descriptions. So they remain quite flexible and, if
done carefully, still remain extensible if you decide to split
the encoding into separate bytes later. (By which I mean that
the detail functions could remain unchanged, with only the
top-level of the decoder needing maintenance.)

switch(code / 64)
{
case edsInteger:
// integer handling function
break;
case edsUtf8:
// string handling function
break;
case edsOpaque:
// opaque handling function
break;
case edsOther:
// "other" handling function or another switch here
break;
}

The integer, utf8 and opaque could immediately jump to their
specific functions passing just the relevant 6 bits of integral
value. The "other" handling could go directly to another
function or be split inplace with another (still quite small)
switch statement.

Small, modular and I suspect just as fast as a huge switch.
There are portability issues with bit twiddling, but there are
various ways to avoid the problems.

There are many variations possible, the above is just a second
pass guess at what may work.

--
Geoff Worboys
Telesis Computing