ibobjects - RE: [IBO] UTF-8 handling

Subject	RE: [IBO] UTF-8 handling
Author	Jason Wharton
Post date	2007-10-30T21:53:05Z

> > If I don't use transliteration from the UTF-8 that comes from the
> > client then there won't be any transliteration when the data is
> > presented to the user in the local character set.
>
> Yes. A user that uses UTF8 as the client character set IMHO *expects*
> to get UTF8 from AsString. The same way he expects to get Windows-1252
> strings when he selects WIN1252 as the Client character set.

I can understand your reasoning. However, please consider that all of the
native IBO controls expect the string data to be transliterated into the
local characterset because they do not inherantly know how to deal with
UTF-8 directly. Thus, by default, I do the transliteration to the local
character set.

> > So, by default I am invoking the Delphi routine to do that
> > transliteration from UTF-8 to the local characterset automatically.
>
> I want to write a multi-language application. An application that can
> deal with strings from English, German, Czech, Russian, Hebrew, etc.
> For that, the normal Delphi VCL controls don't find anyway, I have to
> use Unicode aware controls like TNTWare Controls. So I set the Client
> Character Set to UTF8 to be able to get full Unicode. The "local
> characterset" is meaningless in such an application.
>
> When I get UTF-8 from AsString, I translate it to UTF-16 (WideString)
> and pass it on to the TNT controls (and vice versa). I want to store
> UTF-8 strings inside my application because my Object/Relational
> mapper code uses Strings and not WideStrings.
>
> > How is this a problem for you and please help me get a better grip
> > on what exactly you propose to do with the raw UTF-8 character data.
>
> UTF-8 is not raw. It is a well-defined Unicode Transformation format.
> I know what it is, I can deal with it and I want to deal with it.
> Everything else would mean processor cycles to translate it to
> something else, thereby maybe losing information.
>
> Please correct me if I am wrong: I specify a Client Character Set in
> the CharSet property of my IB_Connection. This is the character set
> that is then used to interface with the Client Library (fbclient.dll).
> The Client Library will transliterate everything that comes from the
> database to that Client Character Set and will transliterate
> everything that comes in for storage from the Client Character Set to
> the character set of the specific column.
>
> +------------+ +--------------+ +-----+ +-------+
> | FB Service |--Network--| fbclient.dll |--API--| IBO |--| MyApp |
> +------------+ +--------------+ +-----+ +-------+
>
> When I use FieldByName ('xy').AsString, IBO will usually deliver
> whatever it gets from the Client Library (Trimming rules and
> OnGetString applied before). So when I specify WIN1252 as the Client
> Character Set, then AsString will deliver a Windows 1252 string,
> because that's what it got from fbclient.dll. Even when the field is
> stored as UTF8 in the database.
>
> Is that all correct? Or do I have a misunderstanding here?

I believe we are both correct and we just need to make IBO work for both of
our interests.

I could create a little piece of code that checks the CharSet property when
a connection is made and look for a _RAW suffix (e.g. UTF-8_RAW) and remove
the _RAW when passing it to the server but making it so that IBO didn't do
any default transliteration. Thus, you would get the raw UTF-8 character
data in the AsString property.

Any other ideas?

Jason Wharton