Subject | RE: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications ofEncoded Data Streams |
---|---|
Author | Svend Meyland Nicolaisen |
Post date | 2005-05-03T09:15:08Z |
> >It is not incomplete. The left hand side intervals are Unicode code points
> > The following "table" shows the bytes needed to represent the
> > different code points in Unicode, using UTF-8 and UTF-16.
> >
> > 0x000000-0x00007F: 1 byte in UTF-8; 2 bytes in UTF-16
> > 0x000080-0x0007FF: 2 bytes in UTF-8; 2 bytes in UTF-16
> > 0x000800-0x00FFFF: 3 bytes in UTF-8; 2 bytes in UTF-16
> > 0x010000-0x10FFFF: 4 bytes in UTF-8; 4 bytes in UTF-16
>
> This table is incomplete because UTF8 and UTF-16 both reserve
> bits/bit patterns to indicate which place in the character
> the byte is residing in. The character FFFF, for example, is
> an illegal pattern in UTF-8.
> The high bits of the pattern are telling me that the first
> byte is one of three bytes, but the pattern is only two bytes long.
not the encoded UTF-8 or UTF-16 patterns.
> Furthermore, both UTF-16 and UTF-8 allow for page extensionsI was talking about the encoding of Unicode only. To encode the highest
> so the potential byte count goes up to 6 bytes per character
> for UTF-8. I am not sure what it is for UTF-16.
possible code point defined by Unicode (0x10FFFF) you only need 4 bytes in
both UTF-8 and UTF-16. UTF-16 doesn't go beyond the code points defined by
Unicode at all.
> > For many character sets that uses many characters outsideYes, that is true. But any data type that takes up more than one byte
> the US-ASCII
> > character set UTF-8 quickly begins to take up as much space
> as UTF-16.
> > On top of that decoding and encoding of UTF-16 is easier and faster
> > than the decoding and encoding of UTF-8.
>
> UTF-16 suffers from endian problems. Do you use big-endian
> or little endian? UTF-8 does not suffer from that problem
> since it was meant as a wire protocol.
suffers from endian problems. FireBird seems to take care about these
problems, I think. Another possibility is to use byte order marks but that
might not be the thing to do in this case.
> > UTF-8 is a great encoding for character sets that needs aThat sounds reasonable. I suppose that FireBird uses fixed sized records
> full 32 bit
> > character set, but the Unicode character set standard has
> been fixed
> > at a maximum of a 21 bit character set (0x010000-0x10FFFF).
>
> > Also if you need to "allocate" space for an X character wide text
> > field in a database like FireBird, I would think that you need to
> > allocate space for the worst case scenario which is 4 times
> X for both
> > UTF-8 and UTF-16. So the potential compressions of UTF-8
> dosn't help much here.
>
> With a multibyte character set, whether UTF-8 or UTF-16, we
> need to get away from the idea of allocating maximum storage
> in bytes for the number of characters. A string allocation
> needs to be exactly as long as it needs to be, and no more.
> Byte size and character size need to be understood and
> treated as distinct measures that may not be directly
> related. i.e. length(x) and sizeof(x) may be distinctly
> different, and should be treated as independent values.
>
> In my not so humble and often underinformed opinion, when
> allocating space, you need to allocate the byte size of the
> string. You may need to be prepared to split pages on
> different boundaries than you do today, and re-org pages if
> they get fragmented.
>
> After an insert or update, re-orging a single page that you
> have already read into memory is relatively cheap, since you
> will have to rewrite the entire page after modification
> anyways. Splitting a page is more complex, but it is similar
> to what you do when you append a record and you run out of
> space in the last active page.
>
> Another way to look at it is that in a multi byte character
> set paradigm, everything becomes a varchar. Special handling
> for CHAR fields is to pad them with trailing spaces on read.
>
> Storage is cheap, but there is no reason to be a storage hog
> because you might need 4 bytes per character.
>
>
today. Are you saying that future releases of FireBird will use varying
record sizes or how will that be implemented?
/Svend