Subject | Re: [firebird-support] Re: Firebird and Unicode queries |
---|---|
Author | Olivier Mascia |
Post date | 2005-02-11T16:59:54Z |
Le 11-févr.-05 à 14:57, Lester Caine a écrit :
Here is how UTF-8 work:
bytes | bits | representation
1 | 7 | 0vvvvvvv
2 | 11 | 110vvvvv 10vvvvvv
3 | 16 | 1110vvvv 10vvvvvv 10vvvvv
4 | 21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
An ascii character (7 bits) is stored as a single byte (line 1 of the
table).
A Unicode character which needs more than 7 bits and up to 11 bits is
stored according to line 2 of the table. And so on up to the line 4
where you see how 4 bytes store up to 21 bits. There are interesting
characteristics in the encoding. When a character is stored in
multi-bytes, all of the bytes, including the first one have the 8th bit
set. All additional bytes have their 2 most significant bits set to
'10' while the first byte of a multi-byte sequence has its 2 most
significant bits set to '11'. Which allows to immediately distinguish
if you're on the first byte of a multi-byte character or on the
trailing bytes of such a sequence. When you identify a byte starting
with '11', you know you're on the first byte of a multi-byte sequence.
And you know also the size of the sequence by checking if the first
byte starts with '110', '1110' or '11110'. This allows for fast forward
skipping. All in all UTF-8 is a very insteresting coding. A binary 0 is
never part of the coding. The only byte with value 0 you could find is
one you would have entered yourself. The zero-terminated C-string
concept is valid.
Binary sorting of a UTF-8 encoded string is exactly the same as binary
sorting of that same string represented in UTF-32 (4 bytes per
characters).
Writing a strlen() function so that it works correctly with UTF-8 is
very simple and efficient thanks to the sequence lenght embedded in the
first byte. Writing functions to skip to the previous or next character
is also very simple thanks to the design of UTF-8. You don't need any
context. You can start at any byte in a buffer containing UTF-8 and
very easily go backward to the beginning of a character (if you fall in
the middle of a multi-byte sequence). Or go to the next character.
That's why I think that UTF-8 **might** be considered, not without some
effort of course, but without too much grief, as an internal
representation and storage representation inside Firebird. But that is
another question.
Not even all of the 17 planes of 64K characters are used today.
For instance Unicode 4.1 which is due by march of this year add 1273
new characters, but this finds its place comfortably into the
code-space ranging from 00 to 10FFFF. There are thousands of available
and reserved codes in the existing code-space, for instance between
030000 and 0DFFFF. Nobody will use a fourth byte before long. Though
Klingon language is now very common in our solar system, we still have
few enough contacts with other galaxies to really care today about
extending the 21 bits code-space.
--
Olivier Mascia
>> What's more, UTF-32 is NOT a "four byte truncation of UTF-8".(fixed width font required to read the following table)
>> Absolutely NOT. Here, you're wrong Lester.
> I put my hand up , and can't see the bit I was quoting from today. My
> mistake was probably miss reading something last night. I was thinking
> 6
> bytes which in my method of working gives 6 lots of 7 bits + extra byte
> flag - 42/43 bits - I had not realised that UNICODE is only 32 bit (
> I'm
> sure there was a larger potential map at some stage but I stand
> corrected )
> So now I need to work out why 6 bytes needs to come into the equation
> at
> all, 21bits should still map to 4bytes with 3 'rollover' flags ;)
Here is how UTF-8 work:
bytes | bits | representation
1 | 7 | 0vvvvvvv
2 | 11 | 110vvvvv 10vvvvvv
3 | 16 | 1110vvvv 10vvvvvv 10vvvvv
4 | 21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
An ascii character (7 bits) is stored as a single byte (line 1 of the
table).
A Unicode character which needs more than 7 bits and up to 11 bits is
stored according to line 2 of the table. And so on up to the line 4
where you see how 4 bytes store up to 21 bits. There are interesting
characteristics in the encoding. When a character is stored in
multi-bytes, all of the bytes, including the first one have the 8th bit
set. All additional bytes have their 2 most significant bits set to
'10' while the first byte of a multi-byte sequence has its 2 most
significant bits set to '11'. Which allows to immediately distinguish
if you're on the first byte of a multi-byte character or on the
trailing bytes of such a sequence. When you identify a byte starting
with '11', you know you're on the first byte of a multi-byte sequence.
And you know also the size of the sequence by checking if the first
byte starts with '110', '1110' or '11110'. This allows for fast forward
skipping. All in all UTF-8 is a very insteresting coding. A binary 0 is
never part of the coding. The only byte with value 0 you could find is
one you would have entered yourself. The zero-terminated C-string
concept is valid.
Binary sorting of a UTF-8 encoded string is exactly the same as binary
sorting of that same string represented in UTF-32 (4 bytes per
characters).
Writing a strlen() function so that it works correctly with UTF-8 is
very simple and efficient thanks to the sequence lenght embedded in the
first byte. Writing functions to skip to the previous or next character
is also very simple thanks to the design of UTF-8. You don't need any
context. You can start at any byte in a buffer containing UTF-8 and
very easily go backward to the beginning of a character (if you fall in
the middle of a multi-byte sequence). Or go to the next character.
That's why I think that UTF-8 **might** be considered, not without some
effort of course, but without too much grief, as an internal
representation and storage representation inside Firebird. But that is
another question.
>> The UNICODE_FSS seem to use 3 bytes, so 24 bits. So I assume that thisIt isn't, as of today.
>> thing called 'UNICODE_FFS' is just like UTF-32 where the most
>> significant byte, which is always zero, is not stored. If that is the
>> case, then, YES, UNICODE_FSS can store the entire Unicode code-space
>> and there is a clear bi-directional full conversion possible between
>> any of these 4 representations : UTF-8, UTF-16, UTF-32, UNICODE_FSS.
> 'seem to use' - Having had another 'quick' look at Unicode4.0.0 Spec,
> can someone confirm that the fourth byte is zero, and not 'reserved for
> future use'?
Not even all of the 17 planes of 64K characters are used today.
For instance Unicode 4.1 which is due by march of this year add 1273
new characters, but this finds its place comfortably into the
code-space ranging from 00 to 10FFFF. There are thousands of available
and reserved codes in the existing code-space, for instance between
030000 and 0DFFFF. Nobody will use a fourth byte before long. Though
Klingon language is now very common in our solar system, we still have
few enough contacts with other galaxies to really care today about
extending the 21 bits code-space.
--
Olivier Mascia