Subject Re: [Firebird-Architect] Data Stream Encoding
Author Geoff Worboys
Jim,

I did say that my suggestion was just a second-pass guess at
what could work. There is considerable latitude...

> const UCHAR* EncodedDataStream::altDecode(const UCHAR *ptr, Value
> *value)
> {
> const UCHAR *p = ptr;
> UCHAR code = *p++;

> switch(code >> 6)
> {
> case 1: // integer
> {
short n = (code & 0x3F);
if (n < 4)
{
int32 val = (signed char) *p++;
while (n >= 0)
{
val = (val << 8) | *p++;
--n;
}
value->setValue (val);
}
else if (n < 8)
{
int64 val = (signed char) *p++;
while (n >= 0)
{
val = (val << 8) | *p++;
--n;
}
value->setValue (val);
}
// perhaps reserve values through to n < 16
// to allow for 128 bit integers?
else if (n < 16)
{
throw no_128bit_integers_yet(n);
}
else
{
// we are left with values 16..63;
n -= 31; // gives -15..32
value->setValue (n);
// we could use a different offset eg:
// n-= 17; // gives -1..46
}
}
break;
...

Whether you use a switch or the series of if/else-if statements
is unlikely to make much difference in the above integer code.
Indeed, if we are trying to conserve clock cycles then we can
unwind the short loops.

But yes, any way you look at it my suggestion will probably
take another few clock cycles. This is especially true if you
follow the primary intention of my suggestion - to split out
early into separate functions dedicated to each decode, rather
than trying to have a single long function.

eg:
> switch(code >> 6)
> {
> case 1: // integer
p = integer_decode(code & 0x3F, ptr, value);
break;

The integer_decode function would accept an integer (or perhaps
an unsigned integer) as its input. How it uses that value is
up to it. Once the integer decoding is isolated from the "data
type decoding" of the main function and vise-versa either could
be altered without impacting the other.

For example; If in stream v2 we decide that the old encoding
does not work and we are going to use two bytes then the data
type decode function changes but the integer_decode will not.


This all comes at the cost of a few clock cycles. Bit twiddling
also has its own set of problems and the function isolation
that I propose will definitely incur some performance overhead.

In compact encodings (where there are very few strings, opaque
or integers out of range) then perhaps the difference could be
significant - in relative terms. The question however should
not really be a matter of direct performance comparison - but
whether the few additional clock cycles are significant in the
overall scheme of things (when included with things like utf8
decoding and so on).

The advantages seem to me to include:
- clearer code (functions that can be read on one screen)
- flexibility (your subordinate functions can decide to use
the additional (6-bit/integer) info any way
that suits its needs)
- stability (each decoding part is isolated, if integers
have the same decoding in stream v2 then the
code can be called from the v2 function.)

Is it worth that cost? We cant really say, because we do not
yet know the cost. It will also depend on how much you are
willing to pay for clear and easy to read code (and this is
relative too, as some people have trouble understanding bit
manipulation code - but that code can be isolated in the top
level data-type decode function).


Included in the latitude is the ability to discard the idea of
embedding lengths for opaque data types. Just include it as a
specific data type and reuse the entire bit-range it was
occupying for future expansion. Indeed you could do something
like this with the top two bits:
10 = integer
11 = utf8
00 = other
01 = other
And now the "other" data type effectively has 7 bits to play
with.


> It isn't my intention to pick on Geof.

Hey its OK. I'm thick skinned and certainly open to discussion
on the relative merits of different ways to solve a problem. I
have no problem at all with you picking on my suggestion, thats
why it was posted. It was not really intended to be used
verbatim, it was just a basis for discussion and refinement.

PS. I am still trying to live with one of my (fairly recent)
decisions to use a really long enumeration (also for decoding
but of a different nature). It does work, but when I go to try
and debug or extend it then I quickly get lost somewhere after
line 100 in the function (constantly looking back to see where
this or that came from). I only wish that I had considered
some early isolation when I wrote that one. One day I do
intend to go back and refactor it.

--
Geoff Worboys
Telesis Computing