Subject | Re: [Firebird-Architect] Re: The Wolf on Firebird 3 |
---|---|
Author | Brad Pepers |
Post date | 2005-11-17T05:54:07Z |
I don't know why someone doesn't develop a string class that works with
Unicode characters and can handle the normal string operations (copy,
compare, length, substring, convert to/from character sets) and then use
that in the code. Internally it really doesn't matter what Unicode
format is used for the strings. All you need is:
1. Normal class methods for constructors and copy methods (ideally using
a copy-on-write or reference counting scheme and documenting any issues
this might have with threads)
2. Comparison methods. I don't think any comparison actually makes
sense without a locale. Even equality is effected by the locale you are
in with some considering separate Unicode characters to be equivalent.
Certainly lexical ordering methods require a locale.
3. Conversion methods. This would allow conversion to/from different
character sets including different Unicode formats so you could say your
data is in UTF-8 or UTF-16 or whatever. These are going to be required
no matter what the internal form is and the user of the library will
need to pick the best interface to use. Windows clients will likely
pass the data into Firebird as UTF-16. Linux client will likely prefer
UTF-8 since its much more common. In both cases the client may decide
instead of pass a C string of a particular character set. It dosn't
really matter. One of these formats will have a somewhat preferential
treatment since it will match the internal Unicode format but I doubt
the speed gained will be a big deal.
4. You need to decide the format to store the strings to disk. It
should be the smallest one to gain the benefit of less time reading from
disk that Jim has mentioned. The format stored to disk again doesn't
have to match whats used internally by the string class. The code to
store it will just ask for the data in the format picked for storing on
disk and pass it back into the string class specifying the format when
the data is read off the disk. I agree with Jim that I think UTF-8
makes the best sense but it could easily be changed down the road as
long as things were encapsulated nicely.
5. You also need just the normal string type operations like the length
(in characters), concatenation, substrings, ...
I'm not 100% sure how indexes work but I suspect they will have to have
a character set (or a locale which implies a default character set?)
specified and then the strings in the index will have been converted to
the character set or something like that? I imagine you want to have
the index strings ordered in a way that you can do byte-wise comparisons
for equality and lexical ordering (though how this works for languages
like Japanese I don't have a clue!).
The important thing, at least as far as I can see, is to come up with
the string class and make things use it. Then if you decide to change
the internal format, it should be relatively easy to do and you can do
so and compare speed and decide the best choice.
Oh and as a final issue, consideration could be given to use one of the
tens or even hundreds of string classes that support Unicode that are
already out there. You might not find one that meets your needs and I
can understand having very strict requirements on what you want it to do
but you might also save some time by using what has been done already.
For example if you decide on using the ICU library from IBM which
provides all sorts of localization code, it might make sense to use
their Unicode string class. You could also then take advantage of their
date/time to/from string conversions by locale (for example what should
happen if you cast a DATE to a VARCHAR in Spain?) and same with number
conversions (should all number when converted to strings use "." for the
decimal place or should it use what the locale expects?).
--
Brad Pepers
brad@...
Unicode characters and can handle the normal string operations (copy,
compare, length, substring, convert to/from character sets) and then use
that in the code. Internally it really doesn't matter what Unicode
format is used for the strings. All you need is:
1. Normal class methods for constructors and copy methods (ideally using
a copy-on-write or reference counting scheme and documenting any issues
this might have with threads)
2. Comparison methods. I don't think any comparison actually makes
sense without a locale. Even equality is effected by the locale you are
in with some considering separate Unicode characters to be equivalent.
Certainly lexical ordering methods require a locale.
3. Conversion methods. This would allow conversion to/from different
character sets including different Unicode formats so you could say your
data is in UTF-8 or UTF-16 or whatever. These are going to be required
no matter what the internal form is and the user of the library will
need to pick the best interface to use. Windows clients will likely
pass the data into Firebird as UTF-16. Linux client will likely prefer
UTF-8 since its much more common. In both cases the client may decide
instead of pass a C string of a particular character set. It dosn't
really matter. One of these formats will have a somewhat preferential
treatment since it will match the internal Unicode format but I doubt
the speed gained will be a big deal.
4. You need to decide the format to store the strings to disk. It
should be the smallest one to gain the benefit of less time reading from
disk that Jim has mentioned. The format stored to disk again doesn't
have to match whats used internally by the string class. The code to
store it will just ask for the data in the format picked for storing on
disk and pass it back into the string class specifying the format when
the data is read off the disk. I agree with Jim that I think UTF-8
makes the best sense but it could easily be changed down the road as
long as things were encapsulated nicely.
5. You also need just the normal string type operations like the length
(in characters), concatenation, substrings, ...
I'm not 100% sure how indexes work but I suspect they will have to have
a character set (or a locale which implies a default character set?)
specified and then the strings in the index will have been converted to
the character set or something like that? I imagine you want to have
the index strings ordered in a way that you can do byte-wise comparisons
for equality and lexical ordering (though how this works for languages
like Japanese I don't have a clue!).
The important thing, at least as far as I can see, is to come up with
the string class and make things use it. Then if you decide to change
the internal format, it should be relatively easy to do and you can do
so and compare speed and decide the best choice.
Oh and as a final issue, consideration could be given to use one of the
tens or even hundreds of string classes that support Unicode that are
already out there. You might not find one that meets your needs and I
can understand having very strict requirements on what you want it to do
but you might also save some time by using what has been done already.
For example if you decide on using the ICU library from IBM which
provides all sorts of localization code, it might make sense to use
their Unicode string class. You could also then take advantage of their
date/time to/from string conversions by locale (for example what should
happen if you cast a DATE to a VARCHAR in Spain?) and same with number
conversions (should all number when converted to strings use "." for the
decimal place or should it use what the locale expects?).
--
Brad Pepers
brad@...