Subject | Re: [firebird-support] Re: Firebird and Unicode queries |
---|---|
Author | Olivier Mascia |
Post date | 2005-02-10T15:40:40Z |
Le 10-févr.-05 à 06:46, Lester Caine a écrit :
way code points are encoded.
"UTF-8 is self-segregating: you can always distinguish a lead byte
11vvvvvv from a fill byte 10vvvvvv and you will never be mistaken about
the beginning or the length of a multibyte character. You can start
parsing backwards at the end or in the middle of a multibyte string and
will soon find a synchronization point. String searches (fgrep) for a
multibyte character beginning with a lead byte will never match on the
fill byte in the middle of an unwanted multibyte character."
Sure you cannot simply consider that the seventh byte of a string is
the seventh character. But you can scan a string quickly for
comparisons of strings, because of the way the encoding is done at the
bit level:
bytes | bits | representation
1 | 7 | 0vvvvvvv
2 | 11 | 110vvvvv 10vvvvvv
3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv
4 | 21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
Reading the next byte after a character, you know immediately how much
bytes are used to code that character. Jumping to an arbitrary byte in
a buffer, you can immediately recognize it as a lead byte or a filler.
If a filler, you can easily go back to previous or next lead byte.
--
Olivier Mascia
> Which is the crux of the problem when trying to manage the data withinHandling UTF-8 strings does not need so much of context, thanks to the
> a
> database field. In the good old days yo could look at the binary data
> and character 'x' would be a position 'x' on ALL records. UNICODE_FSS
> maintains that link at the expense of 24bits per character rather than
> 8, but then string matching is consistent, and SUBSTRING is a simple
> count of characters not needing 'context'.
way code points are encoded.
"UTF-8 is self-segregating: you can always distinguish a lead byte
11vvvvvv from a fill byte 10vvvvvv and you will never be mistaken about
the beginning or the length of a multibyte character. You can start
parsing backwards at the end or in the middle of a multibyte string and
will soon find a synchronization point. String searches (fgrep) for a
multibyte character beginning with a lead byte will never match on the
fill byte in the middle of an unwanted multibyte character."
Sure you cannot simply consider that the seventh byte of a string is
the seventh character. But you can scan a string quickly for
comparisons of strings, because of the way the encoding is done at the
bit level:
bytes | bits | representation
1 | 7 | 0vvvvvvv
2 | 11 | 110vvvvv 10vvvvvv
3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv
4 | 21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
Reading the next byte after a character, you know immediately how much
bytes are used to code that character. Jumping to an arbitrary byte in
a buffer, you can immediately recognize it as a lead byte or a filler.
If a filler, you can easily go back to previous or next lead byte.
--
Olivier Mascia