firebird-architect - Re: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams

Subject	Re: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams
Author	David Johnson
Post date	2005-05-03T02:08:29Z

> Message: 3
> Date: Tue, 3 May 2005 00:14:03 +0200
> From: "Svend Meyland Nicolaisen" <news@...>
> Subject: UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams
>
> I have lately wondered why UTF-8 generally seem to be preferred over UTF-16?
> I can understand the use of UTF-8 in applications that need to maintain
> backward compatibility with the US-ASCII character set and/or mainly uses
> characters from the US-ASCII character set.
>
> The following "table" shows the bytes needed to represent the different code
> points in Unicode, using UTF-8 and UTF-16.
>
> 0x000000-0x00007F: 1 byte in UTF-8; 2 bytes in UTF-16
> 0x000080-0x0007FF: 2 bytes in UTF-8; 2 bytes in UTF-16
> 0x000800-0x00FFFF: 3 bytes in UTF-8; 2 bytes in UTF-16
> 0x010000-0x10FFFF: 4 bytes in UTF-8; 4 bytes in UTF-16

This table is incomplete because UTF8 and UTF-16 both reserve bits/bit
patterns to indicate which place in the character the byte is residing
in. The character FFFF, for example, is an illegal pattern in UTF-8.
The high bits of the pattern are telling me that the first byte is one
of three bytes, but the pattern is only two bytes long.

Furthermore, both UTF-16 and UTF-8 allow for page extensions so the
potential byte count goes up to 6 bytes per character for UTF-8. I am
not sure what it is for UTF-16.

>
>
> For many character sets that uses many characters outside the US-ASCII
> character set UTF-8 quickly begins to take up as much space as UTF-16. On
> top of that decoding and encoding of UTF-16 is easier and faster than the
> decoding and encoding of UTF-8.

UTF-16 suffers from endian problems. Do you use big-endian or little
endian? UTF-8 does not suffer from that problem since it was meant as a
wire protocol.

> UTF-8 is a great encoding for character sets that needs a full 32 bit
> character set, but the Unicode character set standard has been fixed at a
> maximum of a 21 bit character set (0x010000-0x10FFFF).

> Also if you need to "allocate" space for an X character wide text field in a
> database like FireBird, I would think that you need to allocate space for
> the worst case scenario which is 4 times X for both UTF-8 and UTF-16. So the
> potential compressions of UTF-8 dosn't help much here.

With a multibyte character set, whether UTF-8 or UTF-16, we need to get
away from the idea of allocating maximum storage in bytes for the number
of characters. A string allocation needs to be exactly as long as it
needs to be, and no more. Byte size and character size need to be
understood and treated as distinct measures that may not be directly
related. i.e. length(x) and sizeof(x) may be distinctly different, and
should be treated as independent values.

In my not so humble and often underinformed opinion, when allocating
space, you need to allocate the byte size of the string. You may need
to be prepared to split pages on different boundaries than you do today,
and re-org pages if they get fragmented.

After an insert or update, re-orging a single page that you have already
read into memory is relatively cheap, since you will have to rewrite the
entire page after modification anyways. Splitting a page is more
complex, but it is similar to what you do when you append a record and
you run out of space in the last active page.

Another way to look at it is that in a multi byte character set
paradigm, everything becomes a varchar. Special handling for CHAR
fields is to pad them with trailing spaces on read.

Storage is cheap, but there is no reason to be a storage hog because you
might need 4 bytes per character.

> UTF-16 also has the benefit of being compatible with UCS-2.

*nix systems are all utf-8.

>
> /Svend
>
> > -----Original Message-----
> > From: Firebird-Architect@yahoogroups.com
> > [mailto:Firebird-Architect@yahoogroups.com] On Behalf Of Jim Starkey
> > Sent: 2. maj 2005 16:51
> > To: Firebird-Architect@yahoogroups.com
> > Subject: [Firebird-Architect] Applications of Encoded Data Streams
> >
> > The self-describing, platform independent data stream
> > encoding schemes that we've been discussing have at least
> > three interesting potential applications within future
> > versions of Firebird.
> >
> > The first, as described earlier, is the message format used
> > by the lower layers of a new API and plumbing to move data
> > around. The model is that data on engine and client sides
> > encode and transmit database in "original" form with any
> > conversions necessary performed on the data consumer's
> > context. This means that a client can set data in any type
> > it wishes, the data will be encoded and transmitted in those
> > that type, and if the engine needs the data in a different
> > type, the engine performs the conversion. An exception is
> > character encoding. All character data is encoded in utf-8
> > unicode. For reference, the client side of DSQL uses
> > information from "prepare describing" to perform data
> > conversions on client side before sending the data, and
> > character data retains it's original encoding. The goal of
> > encoded data streams is to eliminate intermediate data copy
> > and conversion steps to reduce the overhead of the plumbing.
> >
> > The second application would be to support "getRecord" and
> > "setRecord"
> > methods in the higher layers of the new API. The basic idea
> > is that an application could do a single getRecord all to get
> > a complete row from a ResultSet that could later be used to
> > "setRecord" to a prepared insert statement with the same
> > field list. This mechanism is ideal for a future version on
> > gbak or similar backup/restore utility. Another natural use
> > is in replication.
> >
> > The third application of encoded data streams is as a record
> > encoding in the on-disk structure. There is some appeal to
> > an increase in record density, but the main benefit would be
> > to eliminate physical string length restrictions within the
> > engine. In the current ODS, strings are stored within a
> > record if a fixed length field of some declared size.
> > This worked OK for ASCII, but multi-byte characters, in
> > general, and utf-8, in particular, make this problematic.
> > Use of encoded data stream would divorce the physical storage
> > from the logical declaration, reducing the declared field
> > length to a constraint on the number of characters, not
> > bytes, of a string. It would also allow us to implement a
> > unconstrained length string type like Java, deprecating the
> > concept of fixed length strings as an unfortunate legacy of
> > punched cards.
> > Adoption of encoded data streams within the ODS, however, is
> > predicated on switching the engine from multi-character set
> > to purely utf-8.
> >
> > Personally, I think the ideas of arbitrary string handling,
> > simplified international character set support, and a quantum
> > leap in plumbing
> > efficiency is pretty exciting. The fact that it would also
> > reduce code
> > size and complexity is just icing on the cake.
> >
> > --
> >
> > Jim Starkey
> > Netfrastructure, Inc.
> > 978 526-1376
> >
> >
> >
> >
> >
> > Yahoo! Groups Links
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
>
>
> ________________________________________________________________________
> ________________________________________________________________________
>
>
>
> ------------------------------------------------------------------------
> Yahoo! Groups Links
>
>
>
>
> ------------------------------------------------------------------------
>
>
>
>