Subject | Re: [firebird-support] Storing Delphi 2009 "UnicodeString" into database, UTF8? |
---|---|
Author | Kjell Rilbe |
Post date | 2009-05-20T12:40:22Z |
Stefan Heymann wrote:
graphically rendered characters" because you have diacritics that have
their own codepoints but apply the diacritic to the preceding codepoint,
resulting in a composite character.
For example, the letter Ä can be coded in two ways with Unicode:
1. The codepoint for the letter Ä.
2. The codepoint for the letter A followed by the codepoint for the
diacritic "umlaut".
So would you expect your string library to return 1 or 2 for a string
encoded as example 2 above? It's a single character but two codepoints.
What Delphi does is to treat the whole thing as a sequence of 16 bit
"things". Only in some situation does your code actually have to go inte
further detail, but that approach makes the string library very easy to
use and code compared to what you seem to suggest it should do.
Fwiw, I agree that maybe the string library should at least support all
variants of length calculations:
1. Bytes
2. 16 bit words
3. Codepoints
4. Graphically rendered characters, composites taken into consideration
... anything else? :-)
Kjell
--
--------------------------------------
Kjell Rilbe
DataDIA AB
E-post: kjell@...
Telefon: 08-761 06 55
Mobil: 0733-44 24 64
> Then there's very weak handling of UTF-16. There is a differenceBut even "length in Unicode codepoints" is not 1-1 with "length in
> between the length of a string
> - in bytes
> - in Unicode characters
> - in UTF-16 words
> Depending on your string these can be 3 different numbers. But Delphi
> doesn't care - I, as the application programmer must care (shame!).
graphically rendered characters" because you have diacritics that have
their own codepoints but apply the diacritic to the preceding codepoint,
resulting in a composite character.
For example, the letter Ä can be coded in two ways with Unicode:
1. The codepoint for the letter Ä.
2. The codepoint for the letter A followed by the codepoint for the
diacritic "umlaut".
So would you expect your string library to return 1 or 2 for a string
encoded as example 2 above? It's a single character but two codepoints.
What Delphi does is to treat the whole thing as a sequence of 16 bit
"things". Only in some situation does your code actually have to go inte
further detail, but that approach makes the string library very easy to
use and code compared to what you seem to suggest it should do.
Fwiw, I agree that maybe the string library should at least support all
variants of length calculations:
1. Bytes
2. 16 bit words
3. Codepoints
4. Graphically rendered characters, composites taken into consideration
... anything else? :-)
Kjell
--
--------------------------------------
Kjell Rilbe
DataDIA AB
E-post: kjell@...
Telefon: 08-761 06 55
Mobil: 0733-44 24 64