Subject | Re: [IBO] IBO - IB_String vs AnsiString |
---|---|
Author | m. Th. |
Post date | 2009-04-09T12:32:26Z |
Helen Borrie wrote:
start with some known things. I'll post them here just in order to make
the things clear.
Our story starts with ASCII which uses a 7 bit encoding system to
represent 128 different characters. While ASCII was certainly a
foundation (with its basic set of 128 characters that are still part of
the core of Unicode), it was soon superseded by extended versions that
used the 8th bit to add another 128 characters to the set. Now the
problem is that with so many languages around the world, there was no
simple way to figure out which other characters to include in the set
(at times indicated as ASCII-8). To make the story short, Windows adopts
a different set of characters, called a code page, with a set of
characters depending on your locale configuration and version of
Windows. Beside Windows code pages there are many other standards based
on a similar paging approach. I don't think I really need to tell you
how messy the situation is with the various ISO 8859 encodings (there
are 16 of them, still unable to cover the more complex alphabets),
Windows page codes, multi byte representations to cover Chinese and
other languages. With Unicode, this is all behind us, even though the
new standard has its own complexity and potential problems.
Till D2009, Delphi used only the Windows code page approach introduced
in Win95, exposed as Font.Charset property in the UI artifacts in order
to control the encoding, or in more 'human' words to choose which set of
characters to be displayed. For this it was enough the old 'String' type
(let's call it from now on 'AnsiString') which had only one byte/character.
But as I stated above, this paging approach isn't sufficient to display
all the characters (yes, my query had in the 'Where' clause some letters
from the _polytonic_ Greek alphabet, which isn't included in the
WIN-1253 code page) and also using the Windows code pages one cannot
display _simultaneously_ characters from more than one code page (obvious).
So here we are: Unicode is the name of an international character set,
encompassing the symbols of all written alphabets of the world, of today
and of the past, plus a few more (technical symbols, punctuations etc.).
The Unicode standard (formally referenced as “ISO/IEC 10646”) is defined
and documented by the Unicode Consortium, and contains over 100,000
characters.
To support this, the new string type in Delphi 2009 is String =
UnicodeString where UnicodeString uses two bytes / character and is
based on UTF-16. (We don't discuss now how 100,000 characters can fit in
a two-byte/character string - hint: actually UTF-16 is a variable length
encoding). Also, D2009 supports in addition other string types like
UTF8String and RawByteString. Besides that, one can create custom string
types.
So, for certain situations, the old Windows code page system isn't
enough. And to overcome this, all the properties/functions must be
declared with 'String'. Now, you'll tell me that they are already in
this way. Yes, of course. But because Jason (and in fact, many other
developers) used, behind the scenes, some tricks which assumed that the
'string' is an array of bytes (see in one of my prev. posts the
'FillChar' example), that's why these tricks doesn't work anymore.
A quick (and dirty) fix would be to change the declarations, where
possible, to 'AnsiString' (what Jason did in 4.9 Alpha) but this cuts
down the Unicode advantage. Which can be very problematic, especially
when we think from user experience POV (imagine that one user enters in
the 'Search' box some Unicode characters which happens to _not_ map to
the current Windows code page and the program show an icy "Not Found" or
throws an exception).
Delphi's former inability to deal with these characters. There wasn't an
'relevant international module' for these 'extra' characters till now.
This is Unicode. (FTR, we have Firebird+IBO apps which deals with
multilingual data from enough years now - approx. since 2000 IIRC). But
now in order to have Unicode, EMBT guys were forced to change the
internal representation of string type (and char/pChar for that matter).
And any assumption based on the string's internal layout (IOW, breaking
the type's abstraction) is now doomed.
you (and others) to sort the things out in this matter.
HTH,
m. Th.
PS: Certain things about Unicode were extracted from Marco Cantu's book
"Delphi 2009 Handbook"
>Aha! Now I understand your assumptions. :-) Let's clarify a bit. I'll
>> we can issue and use queries like
>>
>> Select * from t1 where f1 = 'αᾶὰἀᾲᾄᾳᾆ'; /* Unicode characters. I hope
>> that my newsreader will save them correctly */
>>
>
> It didn't; but I'm guessing they are Greek characters? And, yes, provided you would have set the character set of your controls (Font.Charset) correctly for the user input, I don't see why not.
>
>
start with some known things. I'll post them here just in order to make
the things clear.
Our story starts with ASCII which uses a 7 bit encoding system to
represent 128 different characters. While ASCII was certainly a
foundation (with its basic set of 128 characters that are still part of
the core of Unicode), it was soon superseded by extended versions that
used the 8th bit to add another 128 characters to the set. Now the
problem is that with so many languages around the world, there was no
simple way to figure out which other characters to include in the set
(at times indicated as ASCII-8). To make the story short, Windows adopts
a different set of characters, called a code page, with a set of
characters depending on your locale configuration and version of
Windows. Beside Windows code pages there are many other standards based
on a similar paging approach. I don't think I really need to tell you
how messy the situation is with the various ISO 8859 encodings (there
are 16 of them, still unable to cover the more complex alphabets),
Windows page codes, multi byte representations to cover Chinese and
other languages. With Unicode, this is all behind us, even though the
new standard has its own complexity and potential problems.
Till D2009, Delphi used only the Windows code page approach introduced
in Win95, exposed as Font.Charset property in the UI artifacts in order
to control the encoding, or in more 'human' words to choose which set of
characters to be displayed. For this it was enough the old 'String' type
(let's call it from now on 'AnsiString') which had only one byte/character.
But as I stated above, this paging approach isn't sufficient to display
all the characters (yes, my query had in the 'Where' clause some letters
from the _polytonic_ Greek alphabet, which isn't included in the
WIN-1253 code page) and also using the Windows code pages one cannot
display _simultaneously_ characters from more than one code page (obvious).
So here we are: Unicode is the name of an international character set,
encompassing the symbols of all written alphabets of the world, of today
and of the past, plus a few more (technical symbols, punctuations etc.).
The Unicode standard (formally referenced as “ISO/IEC 10646”) is defined
and documented by the Unicode Consortium, and contains over 100,000
characters.
To support this, the new string type in Delphi 2009 is String =
UnicodeString where UnicodeString uses two bytes / character and is
based on UTF-16. (We don't discuss now how 100,000 characters can fit in
a two-byte/character string - hint: actually UTF-16 is a variable length
encoding). Also, D2009 supports in addition other string types like
UTF8String and RawByteString. Besides that, one can create custom string
types.
So, for certain situations, the old Windows code page system isn't
enough. And to overcome this, all the properties/functions must be
declared with 'String'. Now, you'll tell me that they are already in
this way. Yes, of course. But because Jason (and in fact, many other
developers) used, behind the scenes, some tricks which assumed that the
'string' is an array of bytes (see in one of my prev. posts the
'FillChar' example), that's why these tricks doesn't work anymore.
A quick (and dirty) fix would be to change the declarations, where
possible, to 'AnsiString' (what Jason did in 4.9 Alpha) but this cuts
down the Unicode advantage. Which can be very problematic, especially
when we think from user experience POV (imagine that one user enters in
the 'Search' box some Unicode characters which happens to _not_ map to
the current Windows code page and the program show an icy "Not Found" or
throws an exception).
>See my comments above. The problem isn't in Firebird's engine. But in
>> Can you expand a bit?
>>
>
> No: I don't know what "expansion" there might be. Above all, I'm not at all clear about *what* the problem is. I'm supposing (from your clues) that you have your XP or Vista client stations set up for Unicode. I'm supposing too (being unable to test it for myself) that you have your controls set up with Font.Charset = GREEK_CHARSET (or something more appropriate, if you have the relevant international module either as an "extra" or as part of your localised Delphi) and that the data coming from your char/varchar fields is well-formed and is passing across a connection with the right Charset property for those data.
>
>
Delphi's former inability to deal with these characters. There wasn't an
'relevant international module' for these 'extra' characters till now.
This is Unicode. (FTR, we have Firebird+IBO apps which deals with
multilingual data from enough years now - approx. since 2000 IIRC). But
now in order to have Unicode, EMBT guys were forced to change the
internal representation of string type (and char/pChar for that matter).
And any assumption based on the string's internal layout (IOW, breaking
the type's abstraction) is now doomed.
> Look, if these comments aren't relevant to what you are encountering with Delphi 2009 (or indeed, D2007), then please don't feel you have to get distracted by trying to answer these assumptions. Having nothing to test and nothing to test it on, my questions were mere curiosity, trying to grokk what the problem actually is. I should say that it is curiosity flavoured with some self-interest, as I'm working on the second edition of "The Book" and the whole INTL thing is in my "unclassified" basket right now. ;-)No problem at all. :-) That's why I try to be explicit. In order to help
>
> Helen
you (and others) to sort the things out in this matter.
HTH,
m. Th.
PS: Certain things about Unicode were extracted from Marco Cantu's book
"Delphi 2009 Handbook"