Subject Re: [firebird-support] Re: Firebird and Unicode queries
Author Olivier Mascia
Le 11-févr.-05 à 11:40, fxam a écrit :

> This is what I have found:
> -there's a src\intl\cv_unicode_fss.c in the 1.5.2.4731 source code, go
> to
> line 72 and you'll see
> "File System Safe Universal Character Set Transformation Format
> (FSS-UTF)"
> -Glossary in Unicode.org (http://www.unicode.org/glossary/#FSS_UTF)
> says
> that FSS-UTF is now known as UTF-8.
> -What does File System Safe means? I guess it's because FSS (ie UTF-8)
> is
> interpreted as a sequence of bytes, and there is no endian problem.

This leaves the question of what exactly is UNICODE_FSS in Firebird
parlance open. I think only the source code will give the definitive
answer. UNICODE_FSS cannot be what UTF-8 is today. If it was, it would
use a VARIABLE number of bytes for each character (from 1 to 4). I
always read everywhere that FB UNICODE_FSS uses THREE bytes per each
character.

So what is UNICODE_FSS ?

Is it a variable length scheme identical to UTF-8 except that it can
use only 3 bytes (a kind of truncated UTF-8), not able to represent the
full Unicode code-space ?

Or is it a fixed length scheme, which is strictly equivalent to UTF-32,
where the fourth byte (unused) is not stored ?

I suspect and hope the latter is what UNICODE_FSS really is.

It is just a bit sad that for answering such questions as what is
exactly UNICODE_FSS one has to read the source code... ;-)

If I'm right with my above guess about what is UNICODE_FSS, I think it
would deserve an alias name in a latter Firebird version. That one
should be named UTF_32 or UTF32 and that would make things much clearer
for everybody. A simple note could state that this UTF_32 actually only
stores 3 bytes and not 4 as, per Unicode 4.0 standard, the fourth byte
is always 0 (non significant).

--
Olivier Mascia