Subject Re: [firebird-support] Bug with character sets
Author Kjell Rilbe
Martijn Tonies wrote:
> >>> But how does the -client- know what bytes are significant and what
> >>> bytes aren't?
> >>
> >> You are right - client has no idea about length of data in
> >> characters, that's why it just fill whole buffer with spaces.
> >
> > Client has to determine number of characters by dividing buffer size
> > with "max bytes per character". For char(2) data in utf8, buffer size
> > will be 8 byte, and max bytes per character is 4, so the data should eb
> > trimmed to 2 characters (codepoints).
> >
> > That's what Milan said.
>
> We've had a 1-char example, Milan explained "4 bytes / 4 bytes per
> char", so only 1 char.
>
> But my question hasn't been answered:
>
> How is this buffer filled and how do you know what bytes are significant
> and what bytes aren't?

Well, I don't really know anything about the C-level interface, but as
far as I've understood things, the struct that describes a data field in
a result set contains a buffer size. Apparently, the buffer is always
dimensioned to be able to hold N chars for a (var)char(N) column, and in
UTF8, that means a buffer of 4N byte.

Also, as far as I understand, the client knows what character encosing
is used, and from that (or directly?) determine the maximum number of
bytes per codepoint for the encoding used. For UTF8 it's 4 byte. For
Ansi it's 1 byte.

So, the client can find out the buffer size and the max number of bytes
per codepoint. Divide these two to get the field size in codepoints,
i.e. the value of N for a (var)char(N) field.

Then, trim the data in the buffer to N codepoints, which may be anything
from 2 to 8 bytes for a UTF8 char(2) column. It's up to the client to
parse the UTF8 string to find of how many bytes are used by the first
two codepoints i the buffer. Firebird *could* pass zeros or
"uninitialized data" in the remaining part of the buffer, but they have
chosen to pad with spaces.

Martijn, you know what bytes are significant by determining the N from
buffer size divided by "max bytes per character". N is not the number of
bytes that are significant, but the number of codepoints. You have to
parse the buffer to find out how many bytes you need to read to get N
codepoints.

Kjell
--
------------------------------
Kjell Rilbe
DataDIA AB
E-post: kjell.rilbe@...
Telefon: 08-761 06 55
Mobil: 0733-44 24 64


[Non-text portions of this message have been removed]