Subject | Re: [Firebird-Architect] Data Stream Encoding |
---|---|
Author | Jim Starkey |
Post date | 2005-05-04T21:31:55Z |
Dmitry Yemanov wrote:
scheme by twiddling parameters, but we'll come back to that. Let's look
at perforance.
In my scheme, the code to handle a integer is:
const UCHAR* EncodedDataStream::decode(const UCHAR *ptr, Value *value)
{
const UCHAR *p = ptr;
UCHAR code = *p++;
switch(code)
{
case edsIntMinus10:
...
case edsInt31:
value->setValue((short) (code - edsInt0));
break;
case edsIntLen1:
...
case edsIntLen4:
{
int l = code - edsIntLen1;
int32 val = (signed char) *p++;
for (int n = 0; n < l; ++n)
val = (val << 8) | *p++;
value->setValue(val);
}
break;
case edsIntLen5:
case edsIntLen6:
case edsIntLen7:
{
int l = code - edsIntLen1;
int64 val = (signed char) *p++;
for (int n = 0; n < l; ++n)
val = (val << 8) | *p++;
value->setValue(val);
}
break;
}
return p;
}
In Geof's scheme, the equivalent code is:
const UCHAR* EncodedDataStream::altDecode(const UCHAR *ptr, Value
*value)
{
const UCHAR *p = ptr;
UCHAR code = *p++;
switch(code >> 6)
{
case 1: // integer
{
short n = (code << 10) >> 10; // sign extend length
if (n == -32)
{
int length = *p++;
if (length <= 4)
{
int32 val = (signed char) *p++;
for (int n = 0; n < length; ++n)
val = (val << 8) | *p++;
value->setValue (val);
}
else
{
int64 val = (signed char) *p++;
for (int n = 0; n < length; ++n)
val = (val << 8) | *p++;
value->setValue (val);
}
}
else
value->setValue (n);
}
break;
}
return p;
}
Mine requires a simple switch. Geof's requires a switch with a shift
value, two more shifts to extract a sign extended field, at least one
comparison, and another option length fetch and secondary comparison to
avoid unnecessary 64 bit arithmetic on 32 bit machines. Also note that
integers out of the "literal" range require an extra byte. Geof could
improve his by using a few values of the literal range to hold the
integer length.
scheme stores integers in one less byte.
The difference in long utf and opaque strings is even more pronounced.
In my scheme, the length of the length field is embedded in the code.
In Geof's, either variable length binary is required, which is slow to
deconstruct, or another length byte is required.
In Geof's scheme, 192 of the 255 code are requires for three classes:
Integer, opaque, and utf-8. The leaves only a quarter of the code space
for dates, timestamps, time, doubles, null, and scaled integer. Most of
these require lengths for efficient coding. They just don't fit in
Geof's scheme.
I'm currently using less that 140 code points to cover these types.
There's a hard tradeoff between literal integers, embedded utf lengths,
and embedded opaque lengths. What is the correct distribution of code
space among these? No theory will tell you. The way to answer the
question is to analyse a bunch of databases to determine the points of
diminishing returns. With my scheme, we can tune the code space
allocated to each class. In Geof's scheme, we're stuck with 6 bits or
no bits.
It isn't my intention to pick on Geof. The way to analyse something is
compare it to alternatives, which requires a straw man. But I'm
reasonably sure that if you think about, you will see that my scheme is
denser and much, much faster to decode. So other than being smaller,
faster, and more flexible, what's wrong with it?
Jim Starkey
Netfrastructure, Inc.
978 526-1376
>>Your alternative, striking similar to the PDP-11 instruction word,Let's temporarily ignore the fact that Geof's encoding can defined in my
>>incidentally, requires masking, shifting, a fair bit of navigation to
>>get to the specific case.
>>
>>
>
>I wouldn't buy this as an argument. Masking or shifting require a single
>cycle even on very old CPUs. Considering existance of other code in the
>encode/decode routines, I bet we'll see no noticable difference between your
>and Geoff's code.
>
>
scheme by twiddling parameters, but we'll come back to that. Let's look
at perforance.
In my scheme, the code to handle a integer is:
const UCHAR* EncodedDataStream::decode(const UCHAR *ptr, Value *value)
{
const UCHAR *p = ptr;
UCHAR code = *p++;
switch(code)
{
case edsIntMinus10:
...
case edsInt31:
value->setValue((short) (code - edsInt0));
break;
case edsIntLen1:
...
case edsIntLen4:
{
int l = code - edsIntLen1;
int32 val = (signed char) *p++;
for (int n = 0; n < l; ++n)
val = (val << 8) | *p++;
value->setValue(val);
}
break;
case edsIntLen5:
case edsIntLen6:
case edsIntLen7:
{
int l = code - edsIntLen1;
int64 val = (signed char) *p++;
for (int n = 0; n < l; ++n)
val = (val << 8) | *p++;
value->setValue(val);
}
break;
}
return p;
}
In Geof's scheme, the equivalent code is:
const UCHAR* EncodedDataStream::altDecode(const UCHAR *ptr, Value
*value)
{
const UCHAR *p = ptr;
UCHAR code = *p++;
switch(code >> 6)
{
case 1: // integer
{
short n = (code << 10) >> 10; // sign extend length
if (n == -32)
{
int length = *p++;
if (length <= 4)
{
int32 val = (signed char) *p++;
for (int n = 0; n < length; ++n)
val = (val << 8) | *p++;
value->setValue (val);
}
else
{
int64 val = (signed char) *p++;
for (int n = 0; n < length; ++n)
val = (val << 8) | *p++;
value->setValue (val);
}
}
else
value->setValue (n);
}
break;
}
return p;
}
Mine requires a simple switch. Geof's requires a switch with a shift
value, two more shifts to extract a sign extended field, at least one
comparison, and another option length fetch and secondary comparison to
avoid unnecessary 64 bit arithmetic on 32 bit machines. Also note that
integers out of the "literal" range require an extra byte. Geof could
improve his by using a few values of the literal range to hold the
integer length.
>Could you provide a real data example where your idea shows much more denseFor integers in the liternal range, they're equal. For all others, my
>encoding?
>
>
scheme stores integers in one less byte.
The difference in long utf and opaque strings is even more pronounced.
In my scheme, the length of the length field is embedded in the code.
In Geof's, either variable length binary is required, which is slow to
deconstruct, or another length byte is required.
In Geof's scheme, 192 of the 255 code are requires for three classes:
Integer, opaque, and utf-8. The leaves only a quarter of the code space
for dates, timestamps, time, doubles, null, and scaled integer. Most of
these require lengths for efficient coding. They just don't fit in
Geof's scheme.
I'm currently using less that 140 code points to cover these types.
There's a hard tradeoff between literal integers, embedded utf lengths,
and embedded opaque lengths. What is the correct distribution of code
space among these? No theory will tell you. The way to answer the
question is to analyse a bunch of databases to determine the points of
diminishing returns. With my scheme, we can tune the code space
allocated to each class. In Geof's scheme, we're stuck with 6 bits or
no bits.
It isn't my intention to pick on Geof. The way to analyse something is
compare it to alternatives, which requires a straw man. But I'm
reasonably sure that if you think about, you will see that my scheme is
denser and much, much faster to decode. So other than being smaller,
faster, and more flexible, what's wrong with it?
>--
>
>>Thanks for your comments and ideas. I think developers can handle
>>switch statements with large blocks of labels without that much
>>difficulty, but you may be right.
>>
>>
>
>Ugly and too complex code is often an indication of bad design ideas.
>
>
>Dmitry
>
>
>
>
>Yahoo! Groups Links
>
>
>
>
>
>
>
Jim Starkey
Netfrastructure, Inc.
978 526-1376