firebird-architect - UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams

Subject	UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams
Author	Svend Meyland Nicolaisen
Post date	2005-05-02T22:14:03Z

I have lately wondered why UTF-8 generally seem to be preferred over UTF-16?
I can understand the use of UTF-8 in applications that need to maintain
backward compatibility with the US-ASCII character set and/or mainly uses
characters from the US-ASCII character set.

The following "table" shows the bytes needed to represent the different code
points in Unicode, using UTF-8 and UTF-16.

0x000000-0x00007F: 1 byte in UTF-8; 2 bytes in UTF-16
0x000080-0x0007FF: 2 bytes in UTF-8; 2 bytes in UTF-16
0x000800-0x00FFFF: 3 bytes in UTF-8; 2 bytes in UTF-16
0x010000-0x10FFFF: 4 bytes in UTF-8; 4 bytes in UTF-16

For many character sets that uses many characters outside the US-ASCII
character set UTF-8 quickly begins to take up as much space as UTF-16. On
top of that decoding and encoding of UTF-16 is easier and faster than the
decoding and encoding of UTF-8.

UTF-8 is a great encoding for character sets that needs a full 32 bit
character set, but the Unicode character set standard has been fixed at a
maximum of a 21 bit character set (0x010000-0x10FFFF).

Also if you need to "allocate" space for an X character wide text field in a
database like FireBird, I would think that you need to allocate space for
the worst case scenario which is 4 times X for both UTF-8 and UTF-16. So the
potential compressions of UTF-8 dosn't help much here.

UTF-16 also has the benefit of being compatible with UCS-2.

/Svend

> -----Original Message-----
> From: Firebird-Architect@yahoogroups.com
> [mailto:Firebird-Architect@yahoogroups.com] On Behalf Of Jim Starkey
> Sent: 2. maj 2005 16:51
> To: Firebird-Architect@yahoogroups.com
> Subject: [Firebird-Architect] Applications of Encoded Data Streams
>
> The self-describing, platform independent data stream
> encoding schemes that we've been discussing have at least
> three interesting potential applications within future
> versions of Firebird.
>
> The first, as described earlier, is the message format used
> by the lower layers of a new API and plumbing to move data
> around. The model is that data on engine and client sides
> encode and transmit database in "original" form with any
> conversions necessary performed on the data consumer's
> context. This means that a client can set data in any type
> it wishes, the data will be encoded and transmitted in those
> that type, and if the engine needs the data in a different
> type, the engine performs the conversion. An exception is
> character encoding. All character data is encoded in utf-8
> unicode. For reference, the client side of DSQL uses
> information from "prepare describing" to perform data
> conversions on client side before sending the data, and
> character data retains it's original encoding. The goal of
> encoded data streams is to eliminate intermediate data copy
> and conversion steps to reduce the overhead of the plumbing.
>
> The second application would be to support "getRecord" and
> "setRecord"
> methods in the higher layers of the new API. The basic idea
> is that an application could do a single getRecord all to get
> a complete row from a ResultSet that could later be used to
> "setRecord" to a prepared insert statement with the same
> field list. This mechanism is ideal for a future version on
> gbak or similar backup/restore utility. Another natural use
> is in replication.
>
> The third application of encoded data streams is as a record
> encoding in the on-disk structure. There is some appeal to
> an increase in record density, but the main benefit would be
> to eliminate physical string length restrictions within the
> engine. In the current ODS, strings are stored within a
> record if a fixed length field of some declared size.
> This worked OK for ASCII, but multi-byte characters, in
> general, and utf-8, in particular, make this problematic.
> Use of encoded data stream would divorce the physical storage
> from the logical declaration, reducing the declared field
> length to a constraint on the number of characters, not
> bytes, of a string. It would also allow us to implement a
> unconstrained length string type like Java, deprecating the
> concept of fixed length strings as an unfortunate legacy of
> punched cards.
> Adoption of encoded data streams within the ODS, however, is
> predicated on switching the engine from multi-character set
> to purely utf-8.
>
> Personally, I think the ideas of arbitrary string handling,
> simplified international character set support, and a quantum
> leap in plumbing
> efficiency is pretty exciting. The fact that it would also
> reduce code
> size and complexity is just icing on the cake.
>
> --
>
> Jim Starkey
> Netfrastructure, Inc.
> 978 526-1376
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>
>