Subject | Re: [Firebird-Architect] Re: The Wolf on Firebird 3 |
---|---|
Author | Jim Starkey |
Post date | 2005-11-17T03:01:33Z |
Adriano dos Santos Fernandes wrote:
varchars is zapped to binary zeros so compression will eliminate it.
As I have said, I would like to switch to the new record encoding for
records on disk. The encoding distills all types to significant bits,
eliminating common integers entirely and eliminating trailing zeros /
-1s in binary numbers, etc. By measurement on production databases, the
effective size is about half of the current run length encoding form.
It also eliminates any performance hit from over-specifying varchars, an
unavoidable practice. The downside is that there isn't any good way to
handle multiple character sets.
As I see it, there are five alternatives for character set handling:
1. Multiple character sets (current implementation)
2. 32 bit Unicode
3. 16 bit Unicode
4. UTF-16
5. UTF-8
It is clear to me -- and I hope it will be equally clear to others if
you study the code -- that the current implement is complex with
character set handling code distributed in well over a dozen places,
bloats the code, is bug prone (any character copy that doesn't handle
conversion is a bug), is a big performance hit even when used without
conversion do the code to set up, propogate, and test descriptors when
copying or comparing code.
32 Unicode is intellectually defensible, but a terrible waste of
storage. Why waste 4 bytes when probably 95% of the characters stored
in Firebird world wide can be represented in a single byte?
16 bit Unicode is well accepted on Windows and in Java, but can't handle
some Asiatic character sets. Even if these character sets are
relatively unimportant, does it make sense to define them out of
architectural existence?
UTF-16 has all the problems of UTF-8 regarding variable length
characters and the need, on occasion, to expand to full 16 bit (now) or
32 bit (future) Unicode when computing keys and collated comparisons.
It also has endian problems (boo) and requires more bytes to represent
almost any string than UTF-8.
Which leaves us UTF-8. Multi-byte character sequences are a pain, but
only in a tiny number of character handling functions. By far the two
most common string operatings are copy and equality comparison, neither
of which is sensitive to characters.
As Roman said, the desirability of density is not to save disk space --
disks are too cheap to worry about -- but for performance. At the time
I was writing Rdb/ELN (JRD-I), I was working for DEC's disk engineering
group. I can assure you that people who sell disks don't have a natural
liking for compression. I had to explain that the purpose of
compression was not to save disk space -- which was a regretable side
effect -- but to reduce the number of disk transfers for performanc
client API to the engine guts. For SQL compliance, we need to check the
number of characters, not bytes, during an assignment made to what I
hope will be considered legacy field declarations. If we have a
platform independent message format, there will no longer be any reason
to allocate string temps of fixed length (or any length), so a whole
host of problems simply vanish leaving nothing where ugly code used to
reside.
(legacy UDFs aside). Collation expansion for keys and collated compares
require conversion now, so there is no change.
>AFAIK VARCHAR variable isn't truncated to the used length beforeBecause records are stored in fixed formats. The unused space in
>compression. Why?
>
>
varchars is zapped to binary zeros so compression will eliminate it.
As I have said, I would like to switch to the new record encoding for
records on disk. The encoding distills all types to significant bits,
eliminating common integers entirely and eliminating trailing zeros /
-1s in binary numbers, etc. By measurement on production databases, the
effective size is about half of the current run length encoding form.
It also eliminates any performance hit from over-specifying varchars, an
unavoidable practice. The downside is that there isn't any good way to
handle multiple character sets.
As I see it, there are five alternatives for character set handling:
1. Multiple character sets (current implementation)
2. 32 bit Unicode
3. 16 bit Unicode
4. UTF-16
5. UTF-8
It is clear to me -- and I hope it will be equally clear to others if
you study the code -- that the current implement is complex with
character set handling code distributed in well over a dozen places,
bloats the code, is bug prone (any character copy that doesn't handle
conversion is a bug), is a big performance hit even when used without
conversion do the code to set up, propogate, and test descriptors when
copying or comparing code.
32 Unicode is intellectually defensible, but a terrible waste of
storage. Why waste 4 bytes when probably 95% of the characters stored
in Firebird world wide can be represented in a single byte?
16 bit Unicode is well accepted on Windows and in Java, but can't handle
some Asiatic character sets. Even if these character sets are
relatively unimportant, does it make sense to define them out of
architectural existence?
UTF-16 has all the problems of UTF-8 regarding variable length
characters and the need, on occasion, to expand to full 16 bit (now) or
32 bit (future) Unicode when computing keys and collated comparisons.
It also has endian problems (boo) and requires more bytes to represent
almost any string than UTF-8.
Which leaves us UTF-8. Multi-byte character sequences are a pain, but
only in a tiny number of character handling functions. By far the two
most common string operatings are copy and equality comparison, neither
of which is sensitive to characters.
As Roman said, the desirability of density is not to save disk space --
disks are too cheap to worry about -- but for performance. At the time
I was writing Rdb/ELN (JRD-I), I was working for DEC's disk engineering
group. I can assure you that people who sell disks don't have a natural
liking for compression. I had to explain that the purpose of
compression was not to save disk space -- which was a regretable side
effect -- but to reduce the number of disk transfers for performanc
>>NotI would like to eliminate the concept of fixed length string from the
>>an exact one of course, because such a utf8-ization of the internals
>>and storage would certainly receive a great deal of attention to
>>architecture and implementation details. (I have fear that the
>>current UNICODE_FSS implementation uses 3 bytes for each char,
>>needed or not. Also when defining columns, the length you have to
>>give is a kind of byte count, so you have to declare your size * 3,
>>if I remember well. That is obviously not how it should work. That's
>>why I fear the comparison would be probably unfair based on FB1 or
>>FB2. But again that may be an indicator. )
>>
>>
client API to the engine guts. For SQL compliance, we need to check the
number of characters, not bytes, during an assignment made to what I
hope will be considered legacy field declarations. If we have a
platform independent message format, there will no longer be any reason
to allocate string temps of fixed length (or any length), so a whole
host of problems simply vanish leaving nothing where ugly code used to
reside.
>He's got a pretty good track record and a strong set of arguments.
>
>>That's what I'm afraid too. But if the whole engine is utf8-ized, then
>>there is no return back - it will cause many changes to the engine
>>internals that most likely will not be possible to rollback (even if
>>we ignore all the efforts were put into it). So, for now we have only
>>Jim words that everything going to be fine...
>>
>>
>>That's just wrong. Conversions become strictly client side issues
>>
>>
>>
>I'm afraid that it will be a high cost for many users (Windows, for
>example).
>Conversions will be needed when retrieving, sending, sorting and
>comparing strings.
>
>
>
>
(legacy UDFs aside). Collation expansion for keys and collated compares
require conversion now, so there is no change.