Subject Re: [firebird-support] Bug with character sets
Author Mark Rotteveel
> Corrected example:
> Buffer length of 4 bytes, UTF-8 (max 4 bytes per character): will always
> contain exaclt one significant chacater and 0-3 spaces because FB
> allocates a buffer that will have room for eactly 4N bytes for an UTF-8
> column. So, you can reverse that calculation.
>
> You assumed that the allocated buffer is exaclt large enough to fit the
> actual data, while it is really sized to be able to hold the largest
> possible data for it's charset and declared charlength.

Looks like I will have to read up on the wire protocol, but this sounds rather wasteful for a wire protocol (or doesn't this discussion apply to the wire protocol?).
So you are saying that an UTF-8 encoded buffer from Firebird is actually always four byte per character with the remaining bytes stuffed with 0x20? Is that stuffed per character, or at the end?

So:
a CHAR(1) containing 'a' is:
0x61 0x20 0x20 0x20
(might be defendable if you use increments of 4 byte as boundaries, although, using 0x00 would IMHO be a 'better' defendable choice for the stuffing character, but that is probably a legacy decision?)

So what happens with a CHAR(2) containing 'ab'?
Is that stuffed per character (which would be weird and wrecking havoc on standard characterset encoders / decoders):
0x61 0x20 0x20 0x20 0x62 0x20 0x20 0x20
Or at the end (Which is just wasting bandwidth especially with large texts*):
0x61 0x62 0x20 0x20 0x20 0x20 0x20 0x20

I would expect something like that to be encoded as:
0x61 0x62 or maybe as 0x61 0x62 0x00 0x00

* In data in UTF-8 (for Western Europe) usually *most* characters are in the (lower) ascii range and thus only take 1 byte. If I transmit 100 characters as UTF-8 that could be a waste of 300% in terms of bandwidth (100 bytes v 400 bytes).

Mark
--
Neu: GMX FreeDSL Komplettanschluss mit DSL 6.000 Flatrate + Telefonanschluss für nur 17,95 Euro/mtl.!* http://dslspecial.gmx.de/freedsl-aktionspreis/?ac=OM.AD.PD003K11308T4569a