Subject Re: [Firebird-Architect] Re: UTF-8 vs UTF-16
Author Nickolay Samofatov
Hello, Peter !

This is mostly theoretical answer. Practical part will be in the next

>> Forgot to ask. What encoding are you really going to implement ?
>> Currently engine implements UNICODE_FSS, not UTF8.
>> You propose UCS2, not UTF16, right ?

> I propose UTF16BE encoding of the UNICODE subset
> U+0000..0+FFFF, i.e. forget the astral planes, they
> would give us only troubles.

Various OS's and products move to full Unicode standard conformance
and thus start to support unusual codepoints. But Unicode standard
doesn't require to implement support for all defined codepoints

>> I want to remind you that UTF8 character may occupy 1-6 bytes.

> I suspect this is clarified to 1-4 bytes in the most recent
> standards.

True. In recent versions they clarified this.

>> UTF16 character may occupy 1-3 two-byte words.

> 1 or 2: either a single 16bit word which must not be
> out of the Surrogate Area U+D800..U+DFFF or a
> High-Surrogate followed by a Low-Surrogate

True. My memories come from several years ago when I researched
this subject. Look at RFC 2044 for example:

>> Both UNICODE_FSS and USC2 are now obsolete. So fixing MBCS support
>> inside the firebird engine is very important.

> I assume UNICODE_FSS (which never quite existed under this
> name, due to my sources) is equivalent to the UTF-8 encoding
> of the UNICODE subset U+0000..0+FFFF. Otherwise, please
> enlighten me on this issue.

It existed under name Unicode FSS/UTF (around 1994) and later FSS
prefix was dropped.
FSS/UTF == File System Safe UCS Transformation Format.
But at the times this name was used in Firebird all codepoints were
16 bits.

> In summary, I would it see more usefull do support
> more defined subsets of Unicode, than to extend the support
> to the astral planes, which would give troubles in a
> a lot of other tools and computer languages.

There should be no troubles if we fix MBCS support in the engine.

> Peter Jacobi

Nickolay Samofatov