firebird-support - Re: [firebird-support] Writing UTF16 to the database

Subject	Re: [firebird-support] Writing UTF16 to the database
Author	Brad Pepers
Post date	2005-02-16T23:21:15Z

Ann W. Harrison wrote:

> Scott Morgan wrote:
>
>
>>UNICODE_FSS is basicly UTF-8 (UTF-FSS was an old name for what became
>>UTF-8 [0], File System Safe or something like that, basicly because it
>>doesn't have the endian issue that UTF-16/32 have)
>
>
> The difference is actually that UNICODE-FSS is guaranteed not to have
> null bytes within a string - making it file system safe. The internal
> null bytes in UTF-8 are causing the truncation.

UTF-8 doesn't have internal null bytes. The unicode characters 0-127
map directly to 1 UTF-8 byte and anytime there are extra bytes for
unicode characters larger than 127, the high bit it set. Actually you
can tell by looking at any byte of a UTF-8 string whether its a single
byte code or exactly where you are in the middle of a multi-byte code.
It works like this:

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So single byte codes always have 0 in the high bit and multi-byte codes
always have 1 so no embedded nulls. And the first byte of a multi-byte
code always has 11 in the top bits and the others have 10. And once
you've found the byte with 11 in the top bits, the next bits tell you
how long the character is (110 vrs 1110 vrs 11110). So given an
arbitrary index into a UTF-8 string you can look at the byte and know if
you are on a proper UTF-8 character boundary and from the first byte of
every UTF-8 character you know how far to jump ahead to get to the next
character.

The benefits of UTF-8 over UTF-16 or UTF-32 are that it doesn't contain
null bytes (except where a U-0000 character appears in the original
unicode text which is a null byte) and it doesn't cause any endian
problems since everything is byte oriented. Also it is very compact for
encoding text which mostly consists of ASCII characters. The down side
is that you have to inspect the data itself to know the length of a
string since its a variable encoding scheme.

--
Brad Pepers
brad@...