Subject RE: [Firebird-Architect] Re: UTF-8 vs UTF-16
Author David Schnepper
> -----Original Message-----
> From: peter_jacobi.rm [mailto:peter_jacobi@...]
> Sent: Friday, August 15, 2003 2:21 PM
> To: Firebird-Architect@yahoogroups.com
> Subject: [Firebird-Architect] Re: UTF-8 vs UTF-16
>
>
> Hi Nickolay,
>
> > > 3) prohibiting invalid UTF-8 sequences in UNICODE_FSS cols
> > [...] Point 3 us also possibly implemented, but not always enforced
> > (at least INTL API have enough functionality).
>
> My copy of the "InterBase Collation Kit" doc from David
> Schnepper states:
>
> <cite>
> charset_well_formed
> Not used. This was intended to be a pointer to a function that would
> validate that a string was well formed by the rules of a character set
>
> In his examples this always NULL. Do you mean that it in fact
> called and can usefully defined by fbintl*.dll?
> </cite>
>

As far as I know, it isn't actually called. Though several
character sets set the value (in preparation for a future
version that would use it).


> > BTW, don't you remember that Firebird already
> > implements UCS2 charset under name UNICODE in standard
> > fbintl.dll ? It should already have all problems
> > including efficient on-page data compression solved.
>
> Sorry, I'm less than four months looking at FB. From
> my incomplete knowledge I assumed, it is not meant for
> use a database storage charset, as
> - it is not made accessable by gdb$character_sets
> - there is somewhere a warning about sensitivity
> to endianness.
>

Actually, efficient on-page compresssion was not
a solved problem -- it was going to be left to
a future revision. Page compression was based
on a byte-based RLE -- to efficiently compress
Unicode strings you need either a word-based
RLE - or a special encoding for "the space
character" - that would compress via byte RLE.

Dave