Subject Re: [firebird-support] Writing UTF16 to the database
Author Brad Pepers
Lester Caine wrote:
> Scott Morgan wrote:
>
>
>>UNICODE_FSS _is_ UTF-8, it's just an old name for it. We've been over
>>this several times.
>
>
> Then perhaps someone will explain EXACTLY what is going on. UTF-8 CAN be
> up to 6 bytes long! *I* thought that was a mistake, but if you read the
> UNICODE spec fully, then you need 6 bytes for some of the larger
> characters once you include the extra bits designed to identify which
> byte of the character you are currently looking at. WHAT do we actually
> convert FROM to give the internal three byte code ?

No it can't be 6 bytes long. The original unicode space was 32 bits
which would take 6 bytes *but* they agreed to limit things to 21 bits
which can be represented by at most 4 bytes in UTF-8. Read further.

> The bottom line is that Unicode only needs 3 bytes to code the number
> for each character. UTF-8, UTF-16 and UTF-32 are just ways of carrying
> that code over the wire. Along with the different endian problems and
> ways of encoding which character of a sequence of characters you are on
> when you drop at random into a sequence. What we STORE internally just
> has to represent each character, uniquely, and be consistent across all
> collations. It can NOT be UTF-8, it must be the three byte character
> number decoded from UTF-8. The client interface is then responsible for
> converting that three byte code into the correct multiple byte sequence
> for the format required when returning each character.

And why can't it be UTf-8? Whats your argument against UTF-8?

> I've had this same discussion in the past over date and time. Internally
> THOSE are just numbers, and the interface provides the different textual
> versions. So WHAT are we actually storing internally for each character
> of the Unicode table?

What you store internally doesn't really matter since UTF-8/16/32 are
all ways of representing a unicode character. Easiest though is to
match them with a string class to use in Firebird. I would suggest
UTF-8 myself since it doesn't suffer from internal nulls or endian
problems and is very compact for ASCII characters.

Like I said before, you have the strings coming from the client which
you have to decide on the character set for and the strings coming out
of the database for columns and then you have to be able to collate and
compare them internally. Pick what you want internally but whatver it
is you will need to convert all other strings to it and you will need to
be able to compare and collate it. A unicode aware string class would
likely help with this but comparison and collation are not trivial
things with unicode!

--
Brad Pepers
brad@...