firebird-support - Re: [firebird-support] Storing Delphi 2009 "UnicodeString" into database, UTF8?

Subject	Re: [firebird-support] Storing Delphi 2009 "UnicodeString" into database, UTF8?
Author	Stefan Heymann
Post date	2009-05-20T12:28:04Z

>> From Delphi 2009, I'm getting "WideChar" into the driver, is that sufficient
>>to sture UnicodeString.

> UnicodeString is UTF-16, viz., 16-bytes squashed into 2-byte
> munchkins, known as "surrogate pairs". It's surrogate pairs you're
> getting into those 2-byte widechars.

That's wrong. Unicode Characters by definition range from a value of 0
(zero) to 1,114,112 (~one million). Only a fraction of this range is
currently allocated. The most important characters reside in the range
from 0 to 0xFFFF. Anyway, you would need 32 (!) bits per character to
correctly store every character in one storage entity.

As this would be a bad signal:noise ratio, people have thought out
different ways to store Unicode scalar values in byte streams. One of
them (UTF-8) uses from 1 to 4 contiguous 8-bit bytes per Unicode
scalar.

UCS-2 and UTF-16 use 16-bit words. As UCS-2 was unable to store
characters beyond that Basic Multilingual Plane (BMP, 0000..FFFF),
UTF-16 was invented, which needs 2 contiguous 16-bit words to store
those characters beyond the BMP. These, and only these, are called
"Surrogate Pairs".

UCS-4 and UTF-32, you guess it, are 32-bit representations of Unicode
strings.

> It is not UTF-8. D2009 has an AnsiString variant called
> UTF8String...but I have no idea how it maps to strings that Firebird
> could transliterate to UTF8.

These *are* UTF-8 strings. So you can directly use them with Firebird
UTF8 (and UNICODE_FSS, as I understand it).

> As far as I can find out (so far), in order to get them into UTF8
> encoding, you'll need to convert them to a compatible WIN or ISO
> charset client-side and let the client lc_ctype and db-side
> column-defined charsets take care of transliterating them to UTF8
> encoding.

Oh please! Never! You can directly, and without loss, convert UTF-8,
UCS-2, UTF-16, UCS-4, UTF-32 and every other Unicode representation
that mankind invents (ever heard about UTF-7 or Punycode?) into each
other. Converting text to any other character set is likely to lose
something.

> (That doesn't mean there's not a simpler solution out there: the TNT
> guys cracked UTF-8 for older Delphi years ago, after all.) Can't
> help the impression that Codegear jumped out of the frying-pan into
> the fire and/or the current situation reflects a W-in-P rather than
> a done deed.

Delphi 2009 is only a half-baked solution, IMHO. First of all, there
is no way to get my old applications compiled with AnsiString. So it's
the first Turbo Pascal ever that breaks my code (shame!).

Then there's very weak handling of UTF-16. There is a difference
between the length of a string
- in bytes
- in Unicode characters
- in UTF-16 words
Depending on your string these can be 3 different numbers. But Delphi
doesn't care - I, as the application programmer must care (shame!).

>>I'm having trouble to find this stuff in the Delphi Help :-/

> Are you sure they actually know what to tell you? ;-)

I guess that the majority of Delphi developers (not the people that
develop *with* Delphi, but the developers that develop Delphi
*itself*) are native English speakers. For them, the whole problem
with different characters and character sets is far, far away. And
that's probably why the tools that handle Character Set issues often
come out ... well ... mediocre.

Best Regards

Stefan