Subject | Re: [Firebird-Architect] Re: The Wolf on Firebird 3 |
---|---|
Author | Jim Starkey |
Post date | 2005-11-17T16:46:29Z |
Olivier Mascia wrote:
requiring a string scan to compute expanded length, a memory allocation,
and, eventually, a memory deallocation. In either the current record
storage format or the new record encoding, a pre-allocated descriptor
can be set up pointing into the record, obviating the neeed to copy or
expand. Further, the places that handle collations -- index key
generation, sort key generation, and collation based comparison -- there
is never a need to instantiate a full Unicode string. I submit that the
difference in the mappings of UTF-8 and 32 bit Unicode into either key
bytes or collation compare logic is insigificant in code size and
performance. The following code implements the translation of a proper
UTF byte sequence:
UCHAR c = *utf8++;
uint code = utf8Values [c];
uint length = utf8Lengths [c];
if (length > 1 && (*utf8 & 0xC0) == 0x80)
for (; length > 1; --length)
code = (code << 6) | (*utf8++ & 0x3f);
else
code = c;
[Note: the code intentionally considers an invalid UTF-8 byte as a lost
8859-1 character. Whether this is a good or a bad idea, it does allow
automatic in-place conversion between databases created using ISO 8859-1
into Unicode without rebuilding.]
--
Jim Starkey
Netfrastructure, Inc.
978 526-1376
>I'm a strong supporter of the idea of using utf-8 internally and atThat would require unpacking every string on reference, probably
>the storage level.
>I'm just thinking out loud about wether utf-32 for in-memory while
>utf-8 on storage could be an alternate to consider, despite the
>evident increased memory requirements.
>
>
requiring a string scan to compute expanded length, a memory allocation,
and, eventually, a memory deallocation. In either the current record
storage format or the new record encoding, a pre-allocated descriptor
can be set up pointing into the record, obviating the neeed to copy or
expand. Further, the places that handle collations -- index key
generation, sort key generation, and collation based comparison -- there
is never a need to instantiate a full Unicode string. I submit that the
difference in the mappings of UTF-8 and 32 bit Unicode into either key
bytes or collation compare logic is insigificant in code size and
performance. The following code implements the translation of a proper
UTF byte sequence:
UCHAR c = *utf8++;
uint code = utf8Values [c];
uint length = utf8Lengths [c];
if (length > 1 && (*utf8 & 0xC0) == 0x80)
for (; length > 1; --length)
code = (code << 6) | (*utf8++ & 0x3f);
else
code = c;
[Note: the code intentionally considers an invalid UTF-8 byte as a lost
8859-1 character. Whether this is a good or a bad idea, it does allow
automatic in-place conversion between databases created using ISO 8859-1
into Unicode without rebuilding.]
--
Jim Starkey
Netfrastructure, Inc.
978 526-1376