firebird-support - Re: [firebird-support] Re: Using unicode versus WIN1252 (Firebird

Subject	Re: [firebird-support] Re: Using unicode versus WIN1252 (Firebird
Author	Dmitry Yemanov
Post date	2009-01-11T17:24:01Z

Douglas Tosi wrote:

>
> If I understood correctly:
>
> On disk: UTF8 does not use variable-byte characters. Every character
> is 4 bytes long but the whole string is RLE-compressed.
> Over-the-wire: Same as on-disk, except there is no RLE-compression.
> Network overhead for UTF8 strings is, thus, the same as for Unicode
> strings.
> On the client: fbclient strips the trailing zeros of characters
> smaller than 4 bytes and returns a proper UTF8 string to the
> application.
>
> Is that it?

Nope, this is wrong :-) Sorry for explaining things badly. Let's try
once more.

There are no wide (i.e. always multi-byte) Unicode characters in
Firebird, period. Everything is always handled in UTF8 which uses
variable-byte characters. However, every UTF8 field of CHAR(N) always
occupies 4*N bytes in memory, because the [unpacked] record has fixed
format (length) suitable for storing the maximum possible values (i.e. N
characters of 4 bytes each). If the actual value has less than N
characters or if some its characters use less than 4 bytes, then the
field is going to contain some trailing zero bytes not used by any
character, i.e. wasted for nothing.

On disk, this redundant (i.e. longer than the actual UTF8 encoding
needs) string is being RLE-compressed, basically this removes the unused
trailing zeros, making it having the almost "canonical" length.

Over the wire, this UTF8 string is sent "as is" for CHARs and without
trailing zeros for VARCHARs (they're stripped at the remote protocol level).

On the client, your application allocates 4*N bytes in the result buffer
(this length is reported in XSQLVAR::sqllen). The fetched string is
stored in this buffer. I.e. again, this is a "thin" UTF8 string with
some trailing zeros.

In other words, a proper UTF8 string is always being stored and
transfered, but often it has some extra zero bytes at the tail (up to
fit the 4*N buffer length).

Hopefully, this time I'm understandable better :-)

Dmitry