Subject Re: [firebird-support] Writing UTF16 to the database
Author Lester Caine
Brad Pepers wrote:

>>>UNICODE_FSS _is_ UTF-8, it's just an old name for it. We've been over
>>>this several times.
>>
>>Then perhaps someone will explain EXACTLY what is going on. UTF-8 CAN be
>>up to 6 bytes long! *I* thought that was a mistake, but if you read the
>>UNICODE spec fully, then you need 6 bytes for some of the larger
>>characters once you include the extra bits designed to identify which
>>byte of the character you are currently looking at. WHAT do we actually
>>convert FROM to give the internal three byte code ?
>
> No it can't be 6 bytes long. The original unicode space was 32 bits
> which would take 6 bytes *but* they agreed to limit things to 21 bits
> which can be represented by at most 4 bytes in UTF-8. Read further.

A succinct answer that clears everything up ;) (almost)
Thank you
This is part of the problem when looking round examples pulls up out of
date information. I can see now why *ALL* that I was looking at is right
- depending on what time frame one lives in :)
*AND* why it is important to have up to date references to OUR use of
these things.

>>The bottom line is that Unicode only needs 3 bytes to code the number
>>for each character. UTF-8, UTF-16 and UTF-32 are just ways of carrying
>>that code over the wire. Along with the different endian problems and
>>ways of encoding which character of a sequence of characters you are on
>>when you drop at random into a sequence. What we STORE internally just
>>has to represent each character, uniquely, and be consistent across all
>>collations. It can NOT be UTF-8, it must be the three byte character
>>number decoded from UTF-8. The client interface is then responsible for
>>converting that three byte code into the correct multiple byte sequence
>>for the format required when returning each character.
>
> And why can't it be UTf-8? Whats your argument against UTF-8?
I'm talking 'internally' - UTF-8 IS four bytes ( if you trim the
historic crap ) but it is being compressed to three internally. SO what
is that compression - what is the internal format?

<SNIP>
Once an internal single element per character exists, then all of the
actions such as SUBSTRING will work transparently. Collations can be
managed, and comparisons made. I suspect that given current technology,
4 bytes may actually be more efficient in doing that - but that is a
discussion for the Architects list.
I am still left with what goes IN and OUT of UNICODE_FSS
From the Bible ( Helen's book )
> UNICODE_FSS
> Developers need to know that it is effectively a UTF8 implementation.
> Users need to know that is can be used to store UCS16 but not UCS32
> characters( that would take up to 6 bytes per character )......

Replies on this question have claimed that this statement is correct,
but probably now only in an historic context? So I repeat the question -
what does UNICODE_FSS provide? If I put in a four byte UTF-8 character
will I get that back out?
I am only trying to be pedantic here because *I* am looking at the byte
stream coming out and not seeing what I put in, so *I* am doing
something wrong :( It's all very well saying use the the correct format,
but 'fault finding' older examples is giving wrong answers - as far as I
can tell.

--
Lester Caine
-----------------------------
L.S.Caine Electronic Services