Subject | Re: [firebird-support] Writing UTF16 to the database |
---|---|
Author | Brad Pepers |
Post date | 2005-02-20T05:56:17Z |
Lester Caine wrote:
which would take 6 bytes *but* they agreed to limit things to 21 bits
which can be represented by at most 4 bytes in UTF-8. Read further.
all ways of representing a unicode character. Easiest though is to
match them with a string class to use in Firebird. I would suggest
UTF-8 myself since it doesn't suffer from internal nulls or endian
problems and is very compact for ASCII characters.
Like I said before, you have the strings coming from the client which
you have to decide on the character set for and the strings coming out
of the database for columns and then you have to be able to collate and
compare them internally. Pick what you want internally but whatver it
is you will need to convert all other strings to it and you will need to
be able to compare and collate it. A unicode aware string class would
likely help with this but comparison and collation are not trivial
things with unicode!
--
Brad Pepers
brad@...
> Scott Morgan wrote:No it can't be 6 bytes long. The original unicode space was 32 bits
>
>
>>UNICODE_FSS _is_ UTF-8, it's just an old name for it. We've been over
>>this several times.
>
>
> Then perhaps someone will explain EXACTLY what is going on. UTF-8 CAN be
> up to 6 bytes long! *I* thought that was a mistake, but if you read the
> UNICODE spec fully, then you need 6 bytes for some of the larger
> characters once you include the extra bits designed to identify which
> byte of the character you are currently looking at. WHAT do we actually
> convert FROM to give the internal three byte code ?
which would take 6 bytes *but* they agreed to limit things to 21 bits
which can be represented by at most 4 bytes in UTF-8. Read further.
> The bottom line is that Unicode only needs 3 bytes to code the numberAnd why can't it be UTf-8? Whats your argument against UTF-8?
> for each character. UTF-8, UTF-16 and UTF-32 are just ways of carrying
> that code over the wire. Along with the different endian problems and
> ways of encoding which character of a sequence of characters you are on
> when you drop at random into a sequence. What we STORE internally just
> has to represent each character, uniquely, and be consistent across all
> collations. It can NOT be UTF-8, it must be the three byte character
> number decoded from UTF-8. The client interface is then responsible for
> converting that three byte code into the correct multiple byte sequence
> for the format required when returning each character.
> I've had this same discussion in the past over date and time. InternallyWhat you store internally doesn't really matter since UTF-8/16/32 are
> THOSE are just numbers, and the interface provides the different textual
> versions. So WHAT are we actually storing internally for each character
> of the Unicode table?
all ways of representing a unicode character. Easiest though is to
match them with a string class to use in Firebird. I would suggest
UTF-8 myself since it doesn't suffer from internal nulls or endian
problems and is very compact for ASCII characters.
Like I said before, you have the strings coming from the client which
you have to decide on the character set for and the strings coming out
of the database for columns and then you have to be able to collate and
compare them internally. Pick what you want internally but whatver it
is you will need to convert all other strings to it and you will need to
be able to compare and collate it. A unicode aware string class would
likely help with this but comparison and collation are not trivial
things with unicode!
--
Brad Pepers
brad@...