firebird-architect - RE: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications ofEncoded Data Streams

Subject	RE: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications ofEncoded Data Streams
Author	Svend Meyland Nicolaisen
Post date	2005-05-03T09:15:08Z

> >
> > The following "table" shows the bytes needed to represent the
> > different code points in Unicode, using UTF-8 and UTF-16.
> >
> > 0x000000-0x00007F: 1 byte in UTF-8; 2 bytes in UTF-16
> > 0x000080-0x0007FF: 2 bytes in UTF-8; 2 bytes in UTF-16
> > 0x000800-0x00FFFF: 3 bytes in UTF-8; 2 bytes in UTF-16
> > 0x010000-0x10FFFF: 4 bytes in UTF-8; 4 bytes in UTF-16
>
> This table is incomplete because UTF8 and UTF-16 both reserve
> bits/bit patterns to indicate which place in the character
> the byte is residing in. The character FFFF, for example, is
> an illegal pattern in UTF-8.
> The high bits of the pattern are telling me that the first
> byte is one of three bytes, but the pattern is only two bytes long.

It is not incomplete. The left hand side intervals are Unicode code points
not the encoded UTF-8 or UTF-16 patterns.

> Furthermore, both UTF-16 and UTF-8 allow for page extensions
> so the potential byte count goes up to 6 bytes per character
> for UTF-8. I am not sure what it is for UTF-16.

I was talking about the encoding of Unicode only. To encode the highest
possible code point defined by Unicode (0x10FFFF) you only need 4 bytes in
both UTF-8 and UTF-16. UTF-16 doesn't go beyond the code points defined by
Unicode at all.

> > For many character sets that uses many characters outside
> the US-ASCII
> > character set UTF-8 quickly begins to take up as much space
> as UTF-16.
> > On top of that decoding and encoding of UTF-16 is easier and faster
> > than the decoding and encoding of UTF-8.
>
> UTF-16 suffers from endian problems. Do you use big-endian
> or little endian? UTF-8 does not suffer from that problem
> since it was meant as a wire protocol.

Yes, that is true. But any data type that takes up more than one byte
suffers from endian problems. FireBird seems to take care about these
problems, I think. Another possibility is to use byte order marks but that
might not be the thing to do in this case.

> > UTF-8 is a great encoding for character sets that needs a
> full 32 bit
> > character set, but the Unicode character set standard has
> been fixed
> > at a maximum of a 21 bit character set (0x010000-0x10FFFF).
>
> > Also if you need to "allocate" space for an X character wide text
> > field in a database like FireBird, I would think that you need to
> > allocate space for the worst case scenario which is 4 times
> X for both
> > UTF-8 and UTF-16. So the potential compressions of UTF-8
> dosn't help much here.
>
> With a multibyte character set, whether UTF-8 or UTF-16, we
> need to get away from the idea of allocating maximum storage
> in bytes for the number of characters. A string allocation
> needs to be exactly as long as it needs to be, and no more.
> Byte size and character size need to be understood and
> treated as distinct measures that may not be directly
> related. i.e. length(x) and sizeof(x) may be distinctly
> different, and should be treated as independent values.
>
> In my not so humble and often underinformed opinion, when
> allocating space, you need to allocate the byte size of the
> string. You may need to be prepared to split pages on
> different boundaries than you do today, and re-org pages if
> they get fragmented.
>
> After an insert or update, re-orging a single page that you
> have already read into memory is relatively cheap, since you
> will have to rewrite the entire page after modification
> anyways. Splitting a page is more complex, but it is similar
> to what you do when you append a record and you run out of
> space in the last active page.
>
> Another way to look at it is that in a multi byte character
> set paradigm, everything becomes a varchar. Special handling
> for CHAR fields is to pad them with trailing spaces on read.
>
> Storage is cheap, but there is no reason to be a storage hog
> because you might need 4 bytes per character.
>
>

That sounds reasonable. I suppose that FireBird uses fixed sized records
today. Are you saying that future releases of FireBird will use varying
record sizes or how will that be implemented?

/Svend