Subject | Re: [Firebird-Architect] UTF-8 and Compression |
---|---|
Author | Jim Starkey |
Post date | 2005-03-01T01:16:23Z |
Leyne, Sean wrote:
n-squared problem of character sets confounded by collating sequences.
Switching to a universal internal representation reduces it to a problem
linear with the number of collating sequences. It breaks the binding
between character sets and collations. A character set becomes a
bidirectional mapping between the character set and UTF-8. A collation
becomes a simple object that compares two UTF-8 strings, generates a
key, upcases, and downcases (what have I neglected here?). In the new
API we can probably isolate character set conversions to the client,
leaving the engine with a single internal representation and collation
sequences. The legacy API will need to support per-SQLVAR character
sets, but the "new API" can probably get away with pure UTF-8. New
layered APIs, the formalization of IscDbc, can be defined with a single
per-session locale, which fits the Java model and simple sanity nicely.
[Non-text portions of this message have been removed]
>Jim,Stunningly brilliant (i.e. better than good). We currently have an
>
>
>
>>I've been thinking about compression and Olivier's stunning suggestion
>>to switching the engine to all utf-8.
>>
>>
>
>Stunning good or stunning bad?
>
>
>
>
n-squared problem of character sets confounded by collating sequences.
Switching to a universal internal representation reduces it to a problem
linear with the number of collating sequences. It breaks the binding
between character sets and collations. A character set becomes a
bidirectional mapping between the character set and UTF-8. A collation
becomes a simple object that compares two UTF-8 strings, generates a
key, upcases, and downcases (what have I neglected here?). In the new
API we can probably isolate character set conversions to the client,
leaving the engine with a single internal representation and collation
sequences. The legacy API will need to support per-SQLVAR character
sets, but the "new API" can probably get away with pure UTF-8. New
layered APIs, the formalization of IscDbc, can be defined with a single
per-session locale, which fits the Java model and simple sanity nicely.
[Non-text portions of this message have been removed]