Subject | Re: [firebird-support] Writing UTF16 to the database |
---|---|
Author | Lester Caine |
Post date | 2005-02-19T21:55:19Z |
Scott Morgan wrote:
up to 6 bytes long! *I* thought that was a mistake, but if you read the
UNICODE spec fully, then you need 6 bytes for some of the larger
characters once you include the extra bits designed to identify which
byte of the character you are currently looking at. WHAT do we actually
convert FROM to give the internal three byte code ?
The bottom line is that Unicode only needs 3 bytes to code the number
for each character. UTF-8, UTF-16 and UTF-32 are just ways of carrying
that code over the wire. Along with the different endian problems and
ways of encoding which character of a sequence of characters you are on
when you drop at random into a sequence. What we STORE internally just
has to represent each character, uniquely, and be consistent across all
collations. It can NOT be UTF-8, it must be the three byte character
number decoded from UTF-8. The client interface is then responsible for
converting that three byte code into the correct multiple byte sequence
for the format required when returning each character.
I've had this same discussion in the past over date and time. Internally
THOSE are just numbers, and the interface provides the different textual
versions. So WHAT are we actually storing internally for each character
of the Unicode table?
--
Lester Caine
-----------------------------
L.S.Caine Electronic Services
> UNICODE_FSS _is_ UTF-8, it's just an old name for it. We've been overThen perhaps someone will explain EXACTLY what is going on. UTF-8 CAN be
> this several times.
up to 6 bytes long! *I* thought that was a mistake, but if you read the
UNICODE spec fully, then you need 6 bytes for some of the larger
characters once you include the extra bits designed to identify which
byte of the character you are currently looking at. WHAT do we actually
convert FROM to give the internal three byte code ?
The bottom line is that Unicode only needs 3 bytes to code the number
for each character. UTF-8, UTF-16 and UTF-32 are just ways of carrying
that code over the wire. Along with the different endian problems and
ways of encoding which character of a sequence of characters you are on
when you drop at random into a sequence. What we STORE internally just
has to represent each character, uniquely, and be consistent across all
collations. It can NOT be UTF-8, it must be the three byte character
number decoded from UTF-8. The client interface is then responsible for
converting that three byte code into the correct multiple byte sequence
for the format required when returning each character.
I've had this same discussion in the past over date and time. Internally
THOSE are just numbers, and the interface provides the different textual
versions. So WHAT are we actually storing internally for each character
of the Unicode table?
--
Lester Caine
-----------------------------
L.S.Caine Electronic Services