Subject | Re: [firebird-support] Bug with character sets |
---|---|
Author | Kjell Rilbe |
Post date | 2009-05-20T07:58:25Z |
Dimitry Sibiryakov wrote:
"right string length" in codepoints is N, but the buffer returned from
fbclient can contain anything from N to 4N codepoints for a char(N) utf8
column. That's not intuitive, and that's what imho needs to be addressed
somehow.
trim to. Step two is to actually do that trimming. I asked where Milan's
code finds the info required to do step one, but you answered with how
Milan's code does step two.
You seem to think that Milan's code does both in the same step by
converting to UTF16. This is not correct. If you have a buffer like this:
N
O
<space>
<space>
<space>
<space>
<space>
<space>
for a char(2) utf8 column, then a conversion of that buffer into utf16
would result in a 16 byte string containing 8 codepoints. You still
wouldn't know N or how to determine N. Utf16 does handle codepoints that
are 4-byte in utf8. Utf16 uses surrogate pairs, so even in utf16 not all
codepoints are two bytes. Some are four bytes.
Kjell
--
--------------------------------------
Kjell Rilbe
DataDIA AB
E-post: kjell@...
Telefon: 08-761 06 55
Mobil: 0733-44 24 64
>>>>1. What does Milan's code know that fbclient doesn't?Yes you can. That's what Milan's code does. For char(N) columns, the
>>>
>>> Length of string in characters.
>>
>>I understand that Milan derives that from 1) buffer size and 2) max
>>number of bytes per character for the encoding used. The latter is
>>deduced from a charset id. Where does he get that charset id?
>
> From SQLVAR structure, but your assumption is wrong. One can't get
> right string length from buffer size and number of bytes per character.
"right string length" in codepoints is N, but the buffer returned from
fbclient can contain anything from N to 4N codepoints for a char(N) utf8
column. That's not intuitive, and that's what imho needs to be addressed
somehow.
>>>>2. Where does Milan's code get that info from?No. Step one is to determine what length in codepoints (or bytes) to
>>>
>>> From conversion from UTF8 to UTF16 (I have a feeling that Milan's
>>>procedure can have incorrect results if encounter character with 4-bytes
>>>code, though I most likely am wrong).
>>
>>Not really - that's how he does the trimming, not how he deduces the
>>correct length in codepoints.
>
> These tasks are absolutely the same.
trim to. Step two is to actually do that trimming. I asked where Milan's
code finds the info required to do step one, but you answered with how
Milan's code does step two.
You seem to think that Milan's code does both in the same step by
converting to UTF16. This is not correct. If you have a buffer like this:
N
O
<space>
<space>
<space>
<space>
<space>
<space>
for a char(2) utf8 column, then a conversion of that buffer into utf16
would result in a 16 byte string containing 8 codepoints. You still
wouldn't know N or how to determine N. Utf16 does handle codepoints that
are 4-byte in utf8. Utf16 uses surrogate pairs, so even in utf16 not all
codepoints are two bytes. Some are four bytes.
Kjell
--
--------------------------------------
Kjell Rilbe
DataDIA AB
E-post: kjell@...
Telefon: 08-761 06 55
Mobil: 0733-44 24 64