Subject Re: [firebird-support] Bug with character sets
Author Brad Pepers
On 19-May-09, at 7:10 AM, Kjell Rilbe wrote:

> Martijn Tonies wrote:
>>>>> But how does the -client- know what bytes are significant and what
>>>>> bytes aren't?
>>>>
>>>> You are right - client has no idea about length of data in
>>>> characters, that's why it just fill whole buffer with spaces.
>>>
>>> Client has to determine number of characters by dividing buffer size
>>> with "max bytes per character". For char(2) data in utf8, buffer
>>> size
>>> will be 8 byte, and max bytes per character is 4, so the data
>>> should eb
>>> trimmed to 2 characters (codepoints).
>>>
>>> That's what Milan said.
>>
>> We've had a 1-char example, Milan explained "4 bytes / 4 bytes per
>> char", so only 1 char.
>>
>> But my question hasn't been answered:
>>
>> How is this buffer filled and how do you know what bytes are
>> significant
>> and what bytes aren't?
>
> Well, I don't really know anything about the C-level interface, but as
> far as I've understood things, the struct that describes a data
> field in
> a result set contains a buffer size. Apparently, the buffer is always
> dimensioned to be able to hold N chars for a (var)char(N) column,
> and in
> UTF8, that means a buffer of 4N byte.
>
> Also, as far as I understand, the client knows what character encosing
> is used, and from that (or directly?) determine the maximum number of
> bytes per codepoint for the encoding used. For UTF8 it's 4 byte. For
> Ansi it's 1 byte.
>
> So, the client can find out the buffer size and the max number of
> bytes
> per codepoint. Divide these two to get the field size in codepoints,
> i.e. the value of N for a (var)char(N) field.
>
> Then, trim the data in the buffer to N codepoints, which may be
> anything
> from 2 to 8 bytes for a UTF8 char(2) column. It's up to the client to
> parse the UTF8 string to find of how many bytes are used by the first
> two codepoints i the buffer. Firebird *could* pass zeros or
> "uninitialized data" in the remaining part of the buffer, but they
> have
> chosen to pad with spaces.
>
> Martijn, you know what bytes are significant by determining the N from
> buffer size divided by "max bytes per character". N is not the
> number of
> bytes that are significant, but the number of codepoints. You have to
> parse the buffer to find out how many bytes you need to read to get N
> codepoints.

I think I see what I'll need to do though having to cache the max byte
size of all the possible character sets is pretty ugly.

I wonder why I don't see this problem at all with varchar columns
though. If I change the char(1) to varchar(1), everything works
properly and I don't seem to have to do any of this. Is this expected?

Also the problem seems to go away if I change the character set of my
connection to ISO8859_1 even though the character set used for the
database is still UTF8.

--
Brad