firebird-support - Re: [firebird-support] Bug with character sets

Subject	Re: [firebird-support] Bug with character sets
Author	Kjell Rilbe
Post date	2009-05-20T07:58:25Z

Dimitry Sibiryakov wrote:

>>>>1. What does Milan's code know that fbclient doesn't?
>>>
>>> Length of string in characters.
>>
>>I understand that Milan derives that from 1) buffer size and 2) max
>>number of bytes per character for the encoding used. The latter is
>>deduced from a charset id. Where does he get that charset id?
>
> From SQLVAR structure, but your assumption is wrong. One can't get
> right string length from buffer size and number of bytes per character.

Yes you can. That's what Milan's code does. For char(N) columns, the
"right string length" in codepoints is N, but the buffer returned from
fbclient can contain anything from N to 4N codepoints for a char(N) utf8
column. That's not intuitive, and that's what imho needs to be addressed
somehow.

>>>>2. Where does Milan's code get that info from?
>>>
>>> From conversion from UTF8 to UTF16 (I have a feeling that Milan's
>>>procedure can have incorrect results if encounter character with 4-bytes
>>>code, though I most likely am wrong).
>>
>>Not really - that's how he does the trimming, not how he deduces the
>>correct length in codepoints.
>
> These tasks are absolutely the same.

No. Step one is to determine what length in codepoints (or bytes) to
trim to. Step two is to actually do that trimming. I asked where Milan's
code finds the info required to do step one, but you answered with how
Milan's code does step two.

You seem to think that Milan's code does both in the same step by
converting to UTF16. This is not correct. If you have a buffer like this:
N
O
<space>
<space>
<space>
<space>
<space>
<space>
for a char(2) utf8 column, then a conversion of that buffer into utf16
would result in a 16 byte string containing 8 codepoints. You still
wouldn't know N or how to determine N. Utf16 does handle codepoints that
are 4-byte in utf8. Utf16 uses surrogate pairs, so even in utf16 not all
codepoints are two bytes. Some are four bytes.

Kjell
--
--------------------------------------
Kjell Rilbe
DataDIA AB
E-post: kjell@...
Telefon: 08-761 06 55
Mobil: 0733-44 24 64