Subject | Re: [firebird-support] Re: Firebird and Unicode queries |
---|---|
Author | Scott Morgan |
Post date | 2005-02-11T16:29:52Z |
Lester Caine wrote:
efficient when it comes to handling the higher end of the Unicode space.
Each byte in a UTF-8 encoded character not only stores info on what
character it represent but also info on whether it's a start byte or an
extra byte so parsing UTF-8 is easier.
Here the 'x's represent the Unicode data, the numbers the positioning info:
0xxxxxxx
110xxxxx,10xxxxxx
1110xxxx,10xxxxxx,10xxxxxx
And so on.
Notice that if you check a random byte you can easily tell if it's a
start byte or not, and that the start byte tells you how many extra
bytes there are. Also notice that you can run this encoding upto 6 bytes
but the current Unicode space only requires upto 4 bytes to cover the
current Unicode space.
http://en.wikipedia.org/wiki/UTF-8
Scott
>Olivier Mascia wrote:I think what you're missing is the fact that UTF-8 isn't very space
>
>
>>The UNICODE_FSS seem to use 3 bytes, so 24 bits. So I assume that this
>>thing called 'UNICODE_FFS' is just like UTF-32 where the most
>>significant byte, which is always zero, is not stored. If that is the
>>case, then, YES, UNICODE_FSS can store the entire Unicode code-space
>>and there is a clear bi-directional full conversion possible between
>>any of these 4 representations : UTF-8, UTF-16, UTF-32, UNICODE_FSS.
>>
>>
>
>'seem to use' - Having had another 'quick' look at Unicode4.0.0 Spec,
>can someone confirm that the fourth byte is zero, and not 'reserved for
>future use'?
>
>
efficient when it comes to handling the higher end of the Unicode space.
Each byte in a UTF-8 encoded character not only stores info on what
character it represent but also info on whether it's a start byte or an
extra byte so parsing UTF-8 is easier.
Here the 'x's represent the Unicode data, the numbers the positioning info:
0xxxxxxx
110xxxxx,10xxxxxx
1110xxxx,10xxxxxx,10xxxxxx
And so on.
Notice that if you check a random byte you can easily tell if it's a
start byte or not, and that the start byte tells you how many extra
bytes there are. Also notice that you can run this encoding upto 6 bytes
but the current Unicode space only requires upto 4 bytes to cover the
current Unicode space.
http://en.wikipedia.org/wiki/UTF-8
Scott